**Springer Actuarial**

# Mario V. Wüthrich Michael Merz

# Statistical Foundations of Actuarial Learning and its Applications

# **Springer Actuarial**

## **Editors-in-Chief**

Hansjoerg Albrecher, University of Lausanne, Lausanne, Switzerland Michael Sherris, UNSW, Sydney, NSW, Australia

## **Series Editors**

Daniel Bauer, University of Wisconsin-Madison, Madison, WI, USA Stéphane Loisel, ISFA, Université Lyon 1, Lyon, France Alexander J. McNeil, University of York, York, UK Antoon Pelsser, Maastricht University, Maastricht, The Netherlands Ermanno Pitacco, Università di Trieste, Trieste, Italy Gordon Willmot, University of Waterloo, Waterloo, ON, Canada Hailiang Yang, The University of Hong Kong, Hong Kong, Hong Kong This is a series on actuarial topics in a broad and interdisciplinary sense, aimed at students, academics and practitioners in the fields of insurance and finance.

Springer Actuarial informs timely on theoretical and practical aspects of topics like risk management, internal models, solvency, asset-liability management, market-consistent valuation, the actuarial control cycle, insurance and financial mathematics, and other related interdisciplinary areas.

The series aims to serve as a primary scientific reference for education, research, development and model validation.

The type of material considered for publication includes lecture notes, monographs and textbooks. All submissions will be peer-reviewed.

# Statistical Foundations of Actuarial Learning and its Applications

Mario V. Wüthrich Department of Mathematics, RiskLab Switzerland ETH Zürich Zürich, Switzerland

Michael Merz Faculty of Business Administration University of Hamburg Hamburg, Germany

This work was supported by Schweizerische Aktuarvereinigung SAV and Swiss Re.

ISSN 2523-3262 ISSN 2523-3270 (electronic) Springer Actuarial ISBN 978-3-031-12408-2 ISBN 978-3-031-12409-9 (eBook) https://doi.org/10.1007/978-3-031-12409-9

Mathematics Subject Classification: C13, C21/31, C24/34, G22, 62F10, 62F12, 62J07, 62J12, 62M45, 62P05, 68T01, 68T50

© The Authors 2023. This book is an open access publication.

**Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

# **Acknowledgments**

We kindly thank our very generous sponsors, the Swiss Association of Actuaries (SAA) and Swiss Re, for financing the open access option of the electronic version of this book. Our special thanks go to Sabine Betz (President of SAA), Adrian Kolly (Swiss Re), and Holger Walz (SAA) who were very positive and interested in this book project from the very beginning, and who made this open access funding possible within their institutions.

A very special thank you goes to Hans Bühlmann who has been supporting us over the last 30 years. We have had so many inspiring discussions over these years, and we have greatly benefited and learned from Hans' incredible knowledge and intuition.

Jointly with Christoph Buser, we have started to teach the lecture "Data Analytics for Non-Life Insurance Pricing" at ETH Zurich in 2018. Our data analytics lecture focuses (only) on the Poisson claim counts case, but its lecture notes have provided a first draft for this book project. This draft has been developed and extended to the general case of the exponential family. Since our first lecture, we have greatly benefited from interactions with many colleagues and students. In particular, we would like to mention the data science initiative "Actuarial Data Science" of the Swiss Association of Actuaries (chaired by Jürg Schelldorfer), whose tutorials provided a great stimulus for this book. Moreover, we mention the annual Insurance Data Science Conference (chaired by Markus Gesmann and Andreas Tsanakas) and the ASTIN Reading Club (chaired by Ronald Richman and Dimitri Semenovich). Furthermore, we would like to kindly thank Ronald Richman who has always been a driving force behind learning and adapting new machine learning techniques, and we also kindly thank Simon Rentzmann for many interesting discussions on how to apply these techniques on real insurance problems.

We thank the following colleagues by name (in alphabetical order). We collaborated and had inspiring discussions in the field of statistical learning with the following colleagues: Johannes Abegglen, Hansjörg Albrecher, Davide Apolloni, Peter Bühlmann, Christoph Buser, Patrick Cheridito, Łukasz Delong, Paul Embrechts, Andrea Ferrario, Tobias Fissler, Luca Fontana, Daisuke Frei, Tsz Chai Fung, Guangyuan Gao, Yan-Xing Lan, Gee Lee, Mathias Lindholm, Christian Lorentzen, Friedrich Loser, Michael Mayer, Daniel Meier, Alexander Noll, Gareth Peters, Jan Rabenseifner, Peter Reinhard, Simon Rentzmann, Ronald Richman, Ludger Rüschendorf, Robert Salzmann, Marc Sarbach, Jürg Schelldorfer, Pavel Shevchenko, Joël Thomann, Andreas Tsanakas, George Tzougas, Emiliano Valdez, Tim Verdonck, and Patrick Zöchbauer.

# **Contents**







# **Chapter 1 Introduction**

## **1.1 The Statistical Modeling Cycle**

We consider statistical modeling of insurance problems. This comprises the process of data collection, data analysis and statistical model building to forecast insured events that (may) happen in the future. This problem is at the very heart of statistics and statistical modeling. Our goal here is to present and provide the statistical tools that are useful in daily actuarial practice, in particular, we aim at describing the mathematical foundation behind these statistical concepts and how they can be applied. Statistical modeling has a wide range of applications, and, depending on the application, the theoretical aspects may be weighted differently. In insurance pricing we are mainly interested in optimal predictions, whereas economists often use statistical tools to explain observations, and in medical fields one is interested in causal effects that medications have on patients. Therefore, statistical theory is wide ranging, and one should always keep the corresponding application in mind. Shmueli [338] nicely discusses the difference between prediction and explanation; our focus here is mainly on prediction.

Box–Jenkins [49] and McCullagh–Nelder [265] distinguish three processes in statistical modeling: (i) model identification/selection, (ii) estimation, and (iii) prediction. In our statistical modeling cycle these three points are slightly modified and extended:

(1) Data collection, cleaning and pre-processing:

This item takes at least 80% of the total time in statistical modeling. It includes exploratory data analysis, data visualization and data pre-processing. This part of the modeling cycle does not seem to be very scientific, however, it is a highly important step because only extended data analysis allows the modeler to fully understand the data. Based on this knowledge the modeler can formulate her/his research question, her/his model, etc.

(2) Selection of a model class:

Based on the knowledge collected in the first item, the modeler has to select a suitable model class that is able to answer her/his research question. This model class can be in the sense of a data model (proper stochastic model), but it can also be an algorithmic model; we refer to the discussion on the "two modeling cultures" by Breiman [53].


In the final/next step, the selected and fitted model needs to be validated. That is, does the model fit to the data, does it serve at predicting new data, does it answer the research question adequately, is there any better model/process choice, etc.?

(6) Possibly go back to (1):

If the answers in item (5) are not satisfactory, one typically goes back to (1). For instance, data pre-processing needs to be done differently, etc.

Especially, the two modeling cultures discussion of Breiman [53], after the turn of the millennium, has shaken up the statistical community. Having predictive performance as the main criterion, the data modeling culture has gradually shifted to the algorithmic culture, where the model itself plays a secondary role as long as the prediction is accurate. The latter is often in the form of a point predictor which can come from an algorithm. Lifting this discussion to a more scientific level, providing prediction uncertainty will slowly merge the two modeling cultures. There is an other interesting discussion by Efron [116] on prediction, estimation (of model parameters) and attribution (predictor selection), that is very much at the core of statistical modeling. In these notes we want to especially emphasize the one modeling culture view of Yu–Barter [397] who expect the two modeling cultures of Breiman [53] to merge much closer than one would expect. Our goal is to demonstrate how all these different techniques and views can be seen as a unified modeling framework.

Concluding, the purpose of these notes is to discuss and illustrate how the different statistical techniques from the data modeling culture and the algorithmic modeling culture can be combined to solve actuarial questions in the best possible way. The main emphasis in this discussion lies on the statistical modeling tools, and we present these tools along with actuarial examples. In actuarial practice one often distinguishes between life and general insurance. This distinction is done for good reasons. There are legislative reasons that require to legally separate life from general insurance business, but there are also modeling reasons, because insurance products in life and general insurance can have rather different features. In this book, we do not make this distinction because the statistical methods presented here can be useful in both branches of insurance, and we are going to consider life and general insurance examples, e.g., the former considering mortality forecasting and the latter aiming at insurance claims prediction for pricing.

## **1.2 Preliminaries on Probability Theory**

The modern axiomatic foundation of probability theory was introduced in 1933 by the famous mathematician Kolmogoroff [221] in his book called "Grundbegriffe der Wahrscheinlichkeitsrechnung". We give a brief introduction to probability theory and random variables; this introduction follows the lecture notes [387]. Throughout we assume to work on a sufficiently rich probability space *(-, <sup>A</sup>,* <sup>P</sup>*)*, meaning that this probability space should be able to carry all objects that we study. We denote (real-valued) random variables on this probability space by capital letters *Y, Z, . . .*, and random vectors use boldface capital letters, e.g., we have a random vector *Y* = *(Y*1*,...,Yq )* of dimension *<sup>q</sup>* <sup>∈</sup> <sup>N</sup>, where each component *Yk*, 1 <sup>≤</sup> *<sup>k</sup>* <sup>≤</sup> *<sup>q</sup>*, is a random variable. Random variables *Y* are characterized by (cumulative) distribution functions1 *<sup>F</sup>* : <sup>R</sup> → [0*,* <sup>1</sup>], for *<sup>y</sup>* <sup>∈</sup> <sup>R</sup>

$$F(\mathbf{y}) = \mathbb{P}\left[Y \le \mathbf{y}\right],$$

being the probability of the event that *Y* has a realization of less or equal to *y*. We write *Y* ∼ *F* for *Y* having distribution function *F*. Similarly random vectors *Y* ∼ *F* are characterized by (cumulative) distribution functions *<sup>F</sup>* : <sup>R</sup>*<sup>q</sup>* → [0*,* <sup>1</sup>] with

$$F(\mathbf{y}) = \mathbb{P}\left[Y\_{\mathbf{l}} \le \mathbf{y}\_{\mathbf{l}}, \dots, Y\_{\mathbf{q}} \le \mathbf{y}\_{\mathbf{q}}\right] \qquad \text{for } \mathbf{y} = (\mathbf{y}\_{\mathbf{l}}, \dots, \mathbf{y}\_{\mathbf{q}})^{\top} \in \mathbb{R}^{q}.$$

In insurance modeling, there are two important types of random variables, namely, discrete random variables and absolutely continuous random variables:

• The distribution function *F* of a discrete random variable *Y* is a step function with countably many steps in discrete points *<sup>k</sup>* <sup>∈</sup> <sup>N</sup> <sup>⊂</sup> <sup>R</sup>. A discrete random variable has probability weights in these discrete points

$$f(k) = \mathbb{P}\left[Y = k\right] > 0 \qquad \text{for } k \in \mathfrak{N},$$

<sup>1</sup> Cumulative distribution functions *<sup>F</sup>* are right-continuous, non-decreasing with lim*x*→−∞ *F (x)* <sup>=</sup> 0 and lim*x*→∞ *F (x)* = 1.

satisfying *<sup>k</sup>*∈<sup>N</sup> *f (k)* <sup>=</sup> 1. If <sup>N</sup> <sup>⊆</sup> <sup>N</sup>0, the integer-valued random variable *<sup>Y</sup>* is called count random variable. Count random variables are used to model the number of claims in insurance. A similar situation occurs if *Y* models nominal outcomes, for instance, if *Y* models gender with female being encoded by 0 and male being encoded by 1, then *f (*0*)* is the probability weight of having a female and *f (*1*)* = 1 − *f (*0*)* the probability weight of having a male; in this case we identify the finite set <sup>N</sup> = {0*,* <sup>1</sup>}={female*,* male}.

• A random variable *<sup>Y</sup>* <sup>∼</sup> *<sup>F</sup>* is said to be absolutely continuous2 if there exists a non-negative (measurable) function *f* , called density of *Y* , such that

$$F(\mathbf{y}) = \int\_{-\infty}^{\mathbf{y}} f(\mathbf{x}) \, d\mathbf{x} \qquad \text{for all } \mathbf{y} \in \mathbb{R}.$$

In that case we equivalently write *Y* ∼ *f* and *Y* ∼ *F*. Absolutely continuous random variables are often used to model claim sizes in insurance.

More generally speaking, discrete and absolutely continuous random variables have densities *f (*·*)* w.r.t. a *<sup>σ</sup>*-finite measure *<sup>ν</sup>* on <sup>R</sup>. In the former case, this *<sup>σ</sup>*finite measure *<sup>ν</sup>* is the counting measure on <sup>N</sup> <sup>⊂</sup> <sup>R</sup>, and in the latter case it is the Lebesgue measure on R. In actuarial science we also consider mixed cases, for instance, Tweedie's compound Poisson random variable is absolutely continuous on *(*0*,*∞*)* having an additional point mass in 0; this model will be studied in Sect. 2.2.3, below.

Choose a random variable *<sup>Y</sup>* <sup>∼</sup> *<sup>F</sup>* and a measurable function *<sup>h</sup>* : <sup>R</sup> <sup>→</sup> <sup>R</sup>. The expected value of *h(Y )* is defined by (upon existence)

$$\mathbb{E}\left[h(Y)\right] = \int\_{\mathbb{R}} h(\mathbf{y}) \, dF(\mathbf{y}).$$

We mainly focus on the following important examples of function *h*:

• expected value, mean or first moment of *Y* ∼ *F*: for *h(y)* = *y*

$$\mu = \mathbb{E}\left[Y\right] = \int\_{\mathbb{R}} \mathbf{y} \, dF(\mathbf{y});$$

• *<sup>k</sup>*-th moment of *<sup>Y</sup>* <sup>∼</sup> *<sup>F</sup>* for *<sup>k</sup>* <sup>∈</sup> <sup>N</sup>: for *h(y)* <sup>=</sup> *<sup>y</sup><sup>k</sup>*

$$\mathbb{E}\left[\mathbf{y}^{k}\right] = \int\_{\mathbb{R}} \mathbf{y}^{k} \, dF(\mathbf{y});$$

<sup>2</sup> Absolutely continuous is a stronger property than continuous.

#### 1.2 Preliminaries on Probability Theory 5

• moment generating function of *<sup>Y</sup>* <sup>∼</sup> *<sup>F</sup>* in *<sup>r</sup>* <sup>∈</sup> <sup>R</sup>: for *h(y)* <sup>=</sup> *<sup>e</sup>ry*

$$M\_Y(r) = \mathbb{E}\left[e^{rY}\right] = \int\_{\mathbb{R}} e^{r\mathbf{y}} \, dF(\mathbf{y});$$

always subject to existence.

The moment generating function *MY (*·*)* is sufficient for identifying distribution functions of random variables *Y* . The following statements are elementary and their proofs are based on Section 30 of Billingsley [34], for more details we also refer to Chapter 1 in the lecture notes [387]. Assume that the moment generating function of *Y* ∼ *F* has a strictly positive radius of convergence *ρ*<sup>0</sup> *>* 0 around the origin implying that *MY (r) <* ∞ for all *r* ∈ *(*−*ρ*0*, ρ*0*)*. In this case we can write *MY (r)* as a power series expansion

$$M\_Y(r) = \sum\_{k=0}^{\infty} \frac{r^k}{k!} \mathbb{E}\left[Y^k\right] \qquad \text{ for all } r \in (-\rho\_0, \rho\_0).$$

As a consequence we can differentiate *MY (*·*)* in the open interval *(*−*ρ*0*, ρ*0*)* arbitrarily often, term by term under the sum. The derivatives in *r* = 0 provide the *k*-th moments (which all exist and are finite)

$$\frac{d^k}{dr^k} \, M\_Y(r)|\_{r=0} = \mathbb{E}\left[Y^k\right] \qquad \text{for all } k \in \mathbb{N}\_0. \tag{1.1}$$

In particular, in this case we immediately know that all moments of *Y* exist, and these moments completely determine the moment generating function *MY* of *Y* . Another consequence is that for a random variable *Y* , whose moment generating function *MY* has a strictly positive radius of convergence around the origin, the distribution function *F* is fully determined by this moment generating function. That is, if we have two such random variables *Y*<sup>1</sup> and *Y*<sup>2</sup> with *MY*<sup>1</sup> *(r)* = *MY*<sup>2</sup> *(r)* for all *r* ∈ *(*−*r*0*, r*0*)*, for some *r*<sup>0</sup> *>* 0, then *Y*<sup>1</sup> *(*d*)* = *Y*2. <sup>3</sup> Thus, these two random variables have the same distribution function. This statement carries over to the limit, i.e., if we have a sequence of random variables *(Yn)n* whose moment generating functions converge on a common interval *(*−*r*0*, r*0*)*, for some *r*<sup>0</sup> *>* 0, to the moment generating function of *Y* , also being finite on *(*−*r*0*, r*0*)*, then *(Yn)n* converges in distribution to *Y* ; such an argument is used to prove the central limit theorem (CLT).

<sup>3</sup> The notation *Y*<sup>1</sup> *(*d*)* = *Y*<sup>2</sup> is generally used for equality in distribution meaning that *Y*<sup>1</sup> and *Y*<sup>2</sup> have the same distribution function.

In insurance, we often deal with so-called positive random variables *Y* , meaning that *Y* ≥ 0, almost surely (a.s.). In that case, the statements about moment generating functions and distributions hold true without the assumption of having a positive radius of convergence around the origin, see Theorem 22.2 in Billingsley [34]. Note that for positive random variables the moment generating function *MY (r)* exists for all *r* ≤ 0.

Existence of the moment generating function *MY (r)* for some positive *r >* 0 can also be interpreted as having a light-tailed distribution function. Observe that if *MY (r)* exists for some positive *r >* 0, then we can choose *s* ∈ *(*0*,r)* and Chebychev's inequality gives us (we assume *Y* ≥ 0, a.s., here)

$$\mathbb{P}\left[Y > \mathbf{y}\right] \ = \mathbb{P}\left[\exp\{\mathbf{s}Y\} > \exp\{\mathbf{s}\mathbf{y}\}\right] \ \leq \exp\{-\mathbf{s}\mathbf{y}\}M\mathbf{y}(\mathbf{s}).\tag{1.2}$$

The latter tells us that the survival function <sup>1</sup> <sup>−</sup> *F (y)* <sup>=</sup> <sup>P</sup>[*Y>y*] decays exponentially for *y* → ∞. Heavy-tailed distribution functions do not have this property, but the survival function decays slower than exponentially as *y* → ∞. This slower decay of the survival function is the case for so-called subexponential distribution functions (an example is the log-normal distribution, we refer to Rolski et al. [320]) and for regularly varying survival functions (an example is the Pareto distribution). Regularly varying survival functions 1 − *F* have the property

$$\lim\_{\mathbf{y}\to\infty} \frac{1 - F(\mathbf{t}\mathbf{y})}{1 - F(\mathbf{y})} = t^{-\beta} \qquad \text{for all } t > 0 \text{ and some } \beta > 0. \tag{1.3}$$

These distribution functions have a polynomial tail (power tail) with tail index *β >* 0. In particular, if a positively supported distribution function *F* has a regularly varying survival function with tail index *β >* 0, then this distribution function is also subexponential, see Theorem 2.5.5 in Rolski et al. [320].

We are not going to specifically focus on heavy-tailed distribution functions, here, but we will explain how light-tailed random variables can be transformed to enjoy heavy-tailed properties. In these notes, we are mainly interested in studying different aspects of regression modeling. Regression modeling requires numerous observations to be able to successfully fit these models to the data. By definition, large claims are scarce, as they live in the tail of the distribution function and, thus, correspond to rare events. Therefore, it is often not possible to employ a regression model for scarce tail events. For this reason, extreme value analysis only plays a marginal role in these notes, though, it has a significant impact on insurance prices. For more on extreme value theory we refer to the relevant literature, see, e.g., Embrechts et al. [121], Rolski et al. [320], Mikosch [277] and Albrecher et al. [7].

## **1.3 Lab: Exploratory Data Analysis**

Our theory is going to be supported by several data examples. These examples are mostly based on publicly available data. The different data sets are described in detail in Chap. 13. We highly recommend the reader to use these data sets to gain her/his own modeling experience.

We describe some tools here that allow for a descriptive and exploratory analysis of the available data; exploratory data analysis has been introduced and promoted by Tukey [357]. We consider the observed claim sizes of the Swedish motorcycle data set described in Sect. 13.2. This data set consists of 656 (positive) claim amounts *yi*, 1 ≤ *i* ≤ *n* = 656. These claim amounts are illustrated in the boxplots of Fig. 1.1.

Typically in insurance, there are large claims that dominate the picture, see Fig. 1.1 (lhs). This results in right-skewed distribution functions, and such data is better illustrated on the log scale, see Fig. 1.1 (rhs). The latter, of course, assumes that all claims are strictly positive.

Figure 1.2 (lhs) shows the empirical distribution function of the observations *yi*, 1 ≤ *i* ≤ *n*, which is obtained by

$$\widehat{F}\_n(\mathbf{y}) = \frac{1}{n} \sum\_{i=1}^n \mathbb{1}\_{\{\mathbf{y}\_i \le \mathbf{y}\}} \qquad \text{for } \mathbf{y} \in \mathbb{R}.$$

If this data set has been generated by i.i.d. random variables, then the Glivenko– Cantelli theorem [64, 159] tells us that this empirical distribution function *F n* converges uniformly to the (true) data generating distribution function, a.s., as the number *n* of observations converges to infinity, see Theorem 20.6 in Billingsley [34].

Figure 1.2 (rhs) shows the empirical density of the observations *yi*, 1 ≤ *i* ≤ *n*. This empirical density is obtained by considering a kernel smoother of a given

**Fig. 1.1** Boxplot of the claim amounts of the Swedish motorcycle data set: (lhs) on the original scale and (rhs) on the log scale

**Fig. 1.2** (lhs) Empirical distribution and (rhs) empirical density of the observed claim amounts *yi*, 1 ≤ *i* ≤ *n*

bandwidth around each observation *yi*. The standard choice is the Gaussian kernel, with the bandwidth determining the variance parameter *σ*<sup>2</sup> *>* 0 of the Gaussian density,

$$\mathbf{y} \mapsto \widehat{f\_n}(\mathbf{y}) = \frac{1}{n} \sum\_{l=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{-\frac{1}{2} \frac{(\mathbf{y} - \mathbf{y}\_l)^2}{\sigma^2}\right\}.$$

From the graph in Fig. 1.2 (rhs) we observe that the main body of the claim sizes is below an amount of 50'000, but the biggest claim exceeds 200'000. The latter motivates to study heavy-tailedness of the claim size data. Therefore, one usually benchmarks with a distribution function *F* that has a regularly varying survival function with a tail index *β >* 0, see (1.3). Asymptotically a regularly varying survival function behaves as *y*−*β*; for this reason the log-log plot is a popular tool to identify regularly varying tails. The log-log plot of a distribution function *F* is obtained by considering

$$\mathbf{y} > \mathbf{0} \leftrightarrow \ (\mathbf{log}\mathbf{y}, \log(1 - F(\mathbf{y}))) \in \mathbb{R}^2.$$

Figure 1.3 gives the log-log plot of the empirical distribution function *F <sup>n</sup>*. If this plot looks asymptotically (for *y* → ∞) like a straight line with a negative slope −*β*, then the data shows heavy-tailedness in the sense of regular variation. Such data cannot be modeled by a distribution function for which the moment generating function *MY (r)* exists for some positive *r >* 0, see (1.2). Figure 1.3 does not suggest a regularly varying tail as we do not see an obvious asymptotic straight line for increasing claim sizes.

These graphs give us a first indication what the claim size data is about. Later on we are going to introduce explanatory variables that describe the insurance

#### 1.4 Outline of This Book 9

**Fig. 1.3** Log-log plot of the empirical distribution function *F n*

policyholders behind these claims. These explanatory variables characterize the policyholder and the general goal is to get a better description of the claim sizes as a function of these explanatory variables, e.g., older policyholders may cause larger claims than younger ones, etc. Such patterns are called *systematic effects* that can be explained by explanatory variables.

## **1.4 Outline of This Book**

This book has eleven chapters (including the present one), and it has two appendices. We briefly describe the contents of these chapters and appendices.

In Chap. 2 we introduce and discuss the exponential family (EF) and the exponential dispersion family (EDF). The EF and the EDF are by far the most important classes of distribution functions for regression modeling. They include, among others, the Gaussian, the binomial, the Poisson, the gamma, the inverse Gaussian and Tweedie's models. We introduce these families of distribution functions, discuss their properties and provide several examples. Moreover, we introduce the Kullback–Leibler (KL) divergence and the Bregman divergence, which are important tools in model evaluation.

Chapter 3 is on classical statistical decision theory. This chapter is important for historical reasons, but it also provides the right mathematical grounding and intuition for more modern tools from data science and machine learning. In particular, we discuss maximum likelihood estimation (MLE), unbiasedness, consistency and asymptotic normality of MLEs in this chapter.

Chapter 4 is the core theoretical chapter on predictive modeling and forecast evaluation. The main problem in actuarial modeling is to forecast and price future claims. For this, we build predictive models, and this chapter deals with assessing and ranking these predictive models. We therefore introduce the mean squared error of prediction (MSEP) and, more generally, the generalization loss (GL) to assess predictive models. This chapter is complemented by a more decisiontheoretic approach to forecast evaluation, it discusses deviance losses, proper scoring, elicitability, forecast dominance, cross-validation, Akaike's information criterion (AIC) and we give an introduction to the bootstrap simulation method.

Chapter 5 discusses state-of-the-art statistical modeling in insurance which is the generalized linear model (GLM). We discuss GLMs in the light of claim count and claim size modeling, we present feature engineering, model fitting, model selection, over-dispersion, zero-inflated claim counts problems, double GLMs, and insurancespecific issues such as the balance property for having unbiasedness.

Chapter 6 summarizes some techniques that use Bayes' theorem. These are classical Bayesian statistical models, e.g., using the Markov chain Monte Carlo (MCMC) method for model fitting. This chapter discusses regularization of regression models such as ridge and LASSO regularization, which has a Bayesian interpretation, and it concerns the Expectation-Maximization (EM) algorithm. The EM algorithm is a general purpose tool that can handle incomplete data settings. We illustrate this for different examples coming from mixture distributions, censored and truncated claims data.

The core of this book are deep learning methods and neural networks. Chapter 7 considers deep feed-forward neural (FN) networks. We introduce the generic architecture of deep FN networks, and we discuss universality theorems of FN networks. We present network fitting, back-propagation, embedding layers for categorical variables and insurance-specific issues such as the balance property in network fitting and network ensembling to reduce model uncertainty. This chapter is complemented by many examples on non-life insurance pricing, but also on mortality modeling, as well as tools that help to explain deep FN network regression results.

Chapters 8 and 9 consider recurrent neural (RN) networks and convolutional neural (CN) networks. These are special network architectures that are useful for time-series and spatial data modeling, e.g., applied to image recognition problems. Time-series and images have a natural topology, and RN and CN networks try to benefit from this additional structure (over tabular data). We introduce these network architectures and provide insurance-relevant examples.

Chapter 10 discusses natural language processing (NLP) which deals with regression modeling of non-tabular or unstructured text data. We explain how words can be embedded into low-dimension spaces that serve as numerical word encodings. These can then be used for text recognition, either using RN networks or attention layers. We give an example where we aim at predicting claim perils from claim descriptions.

Chapter 11 is a selection of different topics. We mention forecasting under model uncertainty, deep quantile regression, deep composite regression or the LocalGLMnet which is an interpretable FN network architecture. Moreover, we provide a bootstrap example to assess prediction uncertainty, and we discuss mixture density networks.

Chapter 12 (Appendix A) is a technical chapter that discusses universality theorems for networks and sieve estimators, which are useful for studying asymptotic normality within a network framework. Chapter 13 (Appendix B) illustrates the data used in this book.

Finally, we remark that the book is written in a typical mathematical style using the structure of Lemmas, Theorems, etc. Results and statements which are particularly important for applications are highlighted with gray boxes.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 2 Exponential Dispersion Family**

We introduce the exponential family (EF) and the exponential dispersion family (EDF) in this chapter. The single-parameter EF has been introduced in 1934 by the British statistician Sir Fisher [128], and it has been extended to vectorvalued parameters by Darmois [88], Koopman [223] and Pitman [306] between 1935 and 1936. It is the most commonly used family of distribution functions in statistical modeling; among others, it contains the Gaussian distribution, the gamma distribution, the binomial distribution and the Poisson distribution. Its parametrization is taken in a special form that is convenient for statistical modeling. The EF can be introduced in a constructive way providing the main properties of this family of distribution functions. In this chapter we follow Jørgensen [201–203] and Barndorff-Nielsen [23], and we state the most important results based on this constructive introduction. This gives us a unified notation which is going to be useful for our purposes.

## **2.1 Exponential Family**

## *2.1.1 Definition and Properties*

We define the EF w.r.t. a *σ*-finite measure *ν* on R. The results in this section can be generalized to *σ*-finite measures on R*m*, but such an extension is not necessary for our purposes. Select an integer *<sup>k</sup>* <sup>∈</sup> <sup>N</sup>, and choose measurable functions *<sup>a</sup>* : <sup>R</sup> <sup>→</sup> <sup>R</sup> and *<sup>T</sup>* : <sup>R</sup> <sup>→</sup> <sup>R</sup>*k*. <sup>1</sup> Consider for a *canonical parameter <sup>θ</sup>* <sup>∈</sup> <sup>R</sup>*<sup>k</sup>* the Laplace

<sup>1</sup> We could also use boldface notation for *<sup>T</sup>* because *T (y)* <sup>∈</sup> <sup>R</sup>*<sup>k</sup>* is vector-valued, but we prefer to not use boldface notation for (vector-valued) functions.

M. V. Wüthrich, M. Merz, *Statistical Foundations of Actuarial Learning and its Applications*, Springer Actuarial, https://doi.org/10.1007/978-3-031-12409-9\_2

transform

$$\mathfrak{U}(\boldsymbol{\theta}) = \int\_{\mathbb{R}} \exp\left\{\boldsymbol{\theta}^{\top} \boldsymbol{T}(\boldsymbol{\mathsf{y}}) + a(\boldsymbol{\mathsf{y}})\right\} d\boldsymbol{\nu}(\boldsymbol{\mathsf{y}}) .$$

Assume that this Laplace transform is not identically equal to +∞. The *effective domain* is defined by

$$\Theta = \left\{ \theta \in \mathbb{R}^k \colon \mathfrak{L}(\theta) < \infty \right\} \subseteq \mathbb{R}^k. \tag{2.1}$$

**Lemma 2.1** *The effective domain* <sup>⊆</sup> <sup>R</sup>*<sup>k</sup> is a convex set.*

The effective domain is not necessarily an open set, but in many applications it is open. Counterexamples are given in Problem 4.1 of Chapter 1 in Lehmann [244], and in the inverse Gaussian example in Sect. 2.1.3, below.

*Proof of Lemma 2.1* Choose *<sup>θ</sup><sup>i</sup>* <sup>∈</sup> <sup>R</sup>*k*, *<sup>i</sup>* <sup>=</sup> <sup>1</sup>*,* 2, with <sup>L</sup>*(θi) <* <sup>∞</sup>. Set *<sup>θ</sup>* <sup>=</sup> *<sup>c</sup>θ*<sup>1</sup> <sup>+</sup> *(*1 − *c)θ*<sup>2</sup> for *c* ∈ *(*0*,* 1*)*. We use Hölder's inequality, applied to the norms *p* = 1*/c* and *q* = 1*/(*1 − *c)*,

$$\begin{split} \mathfrak{L}(\boldsymbol{\theta}) &= \int\_{\mathbb{R}} \exp\left\{ \left( c\boldsymbol{\theta}\_{1} + (1-c)\boldsymbol{\theta}\_{2} \right)^{\top} T(\mathbf{y}) + a(\mathbf{y}) \right\} d\boldsymbol{\nu}(\mathbf{y}) \\ &= \int\_{\mathbb{R}} \exp\left\{ \theta\_{1}^{\top} T(\mathbf{y}) + a(\mathbf{y}) \right\}^{c} \exp\left\{ \theta\_{2}^{\top} T(\mathbf{y}) + a(\mathbf{y}) \right\}^{1-c} d\boldsymbol{\nu}(\mathbf{y}) \\ &\leq \mathfrak{L}(\boldsymbol{\theta}\_{1})^{c} \mathfrak{L}(\boldsymbol{\theta}\_{2})^{1-c} < \infty. \end{split}$$

This implies *θ* ∈ and proves the claim.

We define *the cumulant function* on the effective domain

$$\kappa: \Theta \to \mathbb{R}, \qquad \theta \mapsto \kappa(\theta) = \log \mathfrak{L}(\theta).$$

**Definition 2.2** The EF with *σ*-finite measure *ν* on R and cumulant function *<sup>κ</sup>* : <sup>→</sup> <sup>R</sup> is given by the distribution functions *<sup>F</sup>* on <sup>R</sup> with

$$dF(\mathbf{y}; \boldsymbol{\theta}) = f(\mathbf{y}; \boldsymbol{\theta})d\boldsymbol{\nu}(\mathbf{y}) = \left. \exp\left\{ \boldsymbol{\theta}^{\top}T(\mathbf{y}) - \kappa(\boldsymbol{\theta}) + a(\mathbf{y}) \right\} d\boldsymbol{\nu}(\mathbf{y}) \right. \tag{2.2}$$

for canonical parameters *<sup>θ</sup>* <sup>∈</sup> <sup>⊆</sup> <sup>R</sup>*k*.

#### 2.1 Exponential Family 15

#### *Remarks 2.3*


**Theorem 2.4** *Assume the effective domain has a non-empty interior* ˚*. Choose <sup>Y</sup>* <sup>∼</sup> *F (*·; *<sup>θ</sup>) for fixed <sup>θ</sup>* <sup>∈</sup> ˚*. The moment generating function of T (Y ) for sufficiently small <sup>r</sup>* <sup>∈</sup> <sup>R</sup>*<sup>k</sup> is given by*

$$M\_{T(Y)}(\mathbf{r}) = \mathbb{E}\_{\theta} \left[ \exp \left\{ \mathbf{r}^{\top} T(Y) \right\} \right] = \exp \left\{ \kappa \left( \theta + \mathbf{r} \right) - \kappa \left( \theta \right) \right\},$$

*where the expectation operator* E*<sup>θ</sup> illustrates the selected canonical parameter θ for Y .*

*Proof* Choose *<sup>θ</sup>* <sup>∈</sup> ˚ and *<sup>r</sup>* <sup>∈</sup> <sup>R</sup>*<sup>k</sup>* so small that *<sup>θ</sup>* <sup>+</sup> *<sup>r</sup>* <sup>∈</sup> ˚. We receive

$$\begin{split} M\_{T(\mathbf{Y})}(\mathbf{r}) &= \int\_{\mathbb{R}} \exp\left\{ (\boldsymbol{\theta} + \mathbf{r})^{\top} T(\mathbf{y}) - \boldsymbol{\kappa}(\boldsymbol{\theta}) + \boldsymbol{a}(\mathbf{y}) \right\} d\boldsymbol{\nu}(\mathbf{y}) \\ &= \exp\left\{ \boldsymbol{\kappa}(\boldsymbol{\theta} + \mathbf{r}) - \boldsymbol{\kappa}(\boldsymbol{\theta}) \right\} \int\_{\mathbb{R}} \exp\left\{ (\boldsymbol{\theta} + \mathbf{r})^{\top} T(\mathbf{y}) - \boldsymbol{\kappa}(\boldsymbol{\theta} + \mathbf{r}) + \boldsymbol{a}(\mathbf{y}) \right\} d\boldsymbol{\nu}(\mathbf{y}) \\ &= \exp\left\{ \boldsymbol{\kappa}(\boldsymbol{\theta} + \mathbf{r}) - \boldsymbol{\kappa}(\boldsymbol{\theta}) \right\}, \end{split}$$

where the last identity follows from the fact that the support of the EF does not depend on the explicit choice of the canonical parameter.

Theorem 2.4 has a couple of immediate implications. First, in any interior point *<sup>θ</sup>* <sup>∈</sup> ˚ both the moment generating function *<sup>r</sup>* <sup>→</sup> *MT (Y )(r)* (in the neighborhood of the origin) and the cumulant function *θ* → *κ(θ)* have derivatives of all orders, and, similarly to Sect. 1.2, moments of all orders of *T (Y )* exist, see also (1.1). Existence of moments of all orders implies that the distribution function of *T (Y )* cannot have a regularly varying tails.

**Corollary 2.5** *Assume* ˚ *is non-empty. The cumulant function <sup>θ</sup>* <sup>→</sup> *κ(θ) is convex, and for <sup>Y</sup>* <sup>∼</sup> *F (*·; *<sup>θ</sup>) with <sup>θ</sup>* <sup>∈</sup> ˚

$$
\mu = \mathbb{E}\_{\theta} \left[ T(Y) \right] = \nabla\_{\theta} \kappa(\theta) \qquad \text{and} \qquad \text{Var}\_{\theta} \left( T(Y) \right) = \nabla\_{\theta}^{2} \kappa(\theta),
$$

*where* <sup>∇</sup>*<sup>θ</sup> is the gradient and* <sup>∇</sup><sup>2</sup> *<sup>θ</sup> the Hessian w.r.t. vector θ.*

Similarly to *<sup>T</sup>* : <sup>R</sup> <sup>→</sup> <sup>R</sup>*k*, we will not use boldface notation for the (multidimensional) mean because later on we will understand the mean *<sup>μ</sup>* <sup>=</sup> *μ(θ)* <sup>∈</sup> <sup>R</sup>*<sup>k</sup>* as a function of the canonical parameter *θ*; see Footnote 1 on page 13 on boldface notation.

*Proof* Existence of the moment generating function for all sufficiently small *<sup>r</sup>* <sup>∈</sup> <sup>R</sup>*<sup>k</sup>* (around the origin) implies that we have first and second moments. For the first moment we receive

$$\mu = \mathbb{E}\_{\theta} \left[ T(Y) \right] = \left. \nabla\_{\mathbf{r}} M\_{T(Y)}(\mathbf{r}) \right|\_{\mathbf{r} = 0} = \exp \left\{ \kappa (\theta + \mathbf{r}) - \kappa (\theta) \right\} \nabla\_{\mathbf{r}} \kappa (\theta + \mathbf{r})|\_{\mathbf{r} = 0} = \nabla\_{\theta} \kappa (\theta).$$

Denote component *<sup>j</sup>* of *T (Y )* <sup>∈</sup> <sup>R</sup>*<sup>k</sup>* by *Tj (Y )*. We have for 1 <sup>≤</sup> *j,l* <sup>≤</sup> *<sup>k</sup>*

$$\begin{split} \mathbb{E}\_{\theta} \left[ T\_{j}(\boldsymbol{Y}) T\_{l}(\boldsymbol{Y}) \right] &= \left. \frac{\partial^{2}}{\partial r\_{j} \partial r\_{l}} M\_{T(\boldsymbol{Y})}(\boldsymbol{r}) \right|\_{\boldsymbol{r}=0} \\ &= \left. \exp \left( \kappa (\boldsymbol{\theta} + \boldsymbol{r}) - \kappa (\boldsymbol{\theta}) \right) \left( \frac{\partial^{2}}{\partial r\_{j} \partial r\_{l}} \kappa (\boldsymbol{\theta} + \boldsymbol{r}) + \frac{\partial}{\partial r\_{j}} \kappa (\boldsymbol{\theta} + \boldsymbol{r}) \frac{\partial}{\partial r\_{l}} \kappa (\boldsymbol{\theta} + \boldsymbol{r}) \right) \right|\_{\boldsymbol{r}=0} \\ &= \left. \left( \frac{\partial^{2}}{\partial \theta\_{j} \partial \theta\_{l}} \kappa (\boldsymbol{\theta}) + \frac{\partial}{\partial \theta\_{j}} \kappa (\boldsymbol{\theta}) \frac{\partial}{\partial \theta\_{l}} \kappa (\boldsymbol{\theta}) \right) . \end{split}$$

This implies for the covariance

$$\operatorname{Cov}\_{\theta}(T\_{j}(Y), T\_{l}(Y)) = \frac{\partial^{2}}{\partial \theta\_{j} \partial \theta\_{l}} \kappa(\theta).$$

The convexity of *<sup>κ</sup>* follows because <sup>∇</sup><sup>2</sup> *<sup>θ</sup> κ(θ)* is the positive semi-definite covariance matrix of *T (Y )*, for all *<sup>θ</sup>* <sup>∈</sup> ˚. This finishes the proof.

**Assumption 2.6 (Minimal Representation)** *We assume that the interior* ˚ *of the effective domain is non-empty and that the cumulant function κ is strictly convex on this interior* ˚*.*

#### 2.1 Exponential Family 17

#### *Remarks 2.7*


**Definition 2.8** The canonical link is defined by *<sup>h</sup>* <sup>=</sup> *(*∇*<sup>θ</sup> κ)*−1.

The application of the canonical link *h* to the mean implies under Assumption 2.6

$$h\left(\mu\right) = h\left(\mathbb{E}\_{\theta}\left[T(Y)\right]\right) = \theta,$$

for mean *<sup>μ</sup>* <sup>=</sup> <sup>E</sup>*<sup>θ</sup>* [*T (Y )*] of *<sup>Y</sup>* <sup>∼</sup> *F (*·; *<sup>θ</sup>)* with *<sup>θ</sup>* <sup>∈</sup> ˚.

*Remarks 2.9 (Dual Parameter Space)* Assumption 2.6 provides that the canonical link *h* is well-defined, and we can either work with the canonical parameter representation *<sup>θ</sup>* <sup>∈</sup> ˚ <sup>⊆</sup> <sup>R</sup>*<sup>k</sup>* or with its dual (mean) parameter representation *<sup>μ</sup>* <sup>=</sup> <sup>E</sup>*<sup>θ</sup>* [*T (Y )*] <sup>∈</sup> *<sup>M</sup>* with

$$\mathcal{M} \stackrel{\text{def.}}{=} \nabla\_{\theta} \kappa(\stackrel{\circ}{\Theta}) = \{\nabla\_{\theta} \kappa(\theta); \ \theta \in \stackrel{\circ}{\Theta}\} \subseteq \mathbb{R}^{k}. \tag{2.3}$$

Strict convexity of *κ* implies that there is a one-to-one correspondence between these two parametrizations. is called the *effective domain* and *M* is called the *dual parameter space* or the *mean parameter space*.

In Sect. 2.2.4, below, we introduce one more property called *steepness* that the cumulant function *κ* should satisfy. This additional property gives a relationship between the support T of the random variables *T (Y )* of the given EF and the boundary of the dual parameter space *M*. This steepness property is important for parameter estimation.

## *2.1.2 Single-Parameter Linear EF: Count Variable Examples*

We start by giving single-parameter discrete linear EF examples based on counting measures on <sup>N</sup>0. Since we work in one dimension *<sup>k</sup>* <sup>=</sup> 1, we replace boldface *<sup>θ</sup>* by scalar *<sup>θ</sup>* <sup>∈</sup> <sup>⊆</sup> <sup>R</sup> in this section.

#### **Bernoulli Distribution as a Single-Parameter Linear EF**

For the Bernoulli distribution with parameter *p* ∈ *(*0*,* 1*)* we choose as *ν* the counting measure on {0*,* 1}. We make the following choices: *T (y)* = *y*,

$$a(\mathbf{y}) = 0, \quad \kappa(\theta) = \log(1 + e^{\theta}), \quad p = \kappa'(\theta) = \frac{e^{\theta}}{1 + e^{\theta}}, \quad \theta = h(p) = \log\left(\frac{p}{1 - p}\right), \quad p = \frac{1}{1 + e^{\theta}}$$

for effective domain <sup>=</sup> <sup>R</sup>, dual parameter space *<sup>M</sup>* <sup>=</sup> *(*0*,* <sup>1</sup>*)* and support <sup>T</sup> <sup>=</sup> {0*,* 1} of *Y* = *T (Y )*. With these choices we have

$$dF(\mathbf{y};\theta) = \exp\left\{\theta\mathbf{y} - \log(\mathbf{l} + e^{\theta})\right\} d\nu(\mathbf{y}) = \left(\frac{e^{\theta}}{1 + e^{\theta}}\right)^{\mathsf{y}} \left(\frac{1}{1 + e^{\theta}}\right)^{1 - \mathsf{y}} d\nu(\mathbf{y}).$$

*θ* → *κ (θ )* is the logistic or sigmoid function, and the canonical link *p* → *h(p)* is the logit function. Mean and variance are given by

$$\mu = \mathbb{E}\_{\theta} \left[ Y \right] = \kappa'(\theta) = p \quad \text{and} \quad \text{Var}\_{\theta} \left( Y \right) = \kappa''(\theta) = \frac{e^{\theta}}{(1 + e^{\theta})^2} = p(1 - p),$$

and the probability weights satisfy for *<sup>y</sup>* <sup>∈</sup> <sup>T</sup> = {0*,* <sup>1</sup>}

$$\mathbb{P}\_{\theta}[Y=\mathbf{y}] = p^{\mathbf{y}}(1-p)^{1-\mathbf{y}}\dots$$

#### **Binomial Distribution as a Single-Parameter Linear EF**

For the binomial distribution with parameters *<sup>n</sup>* <sup>∈</sup> <sup>N</sup> and *<sup>p</sup>* <sup>∈</sup> *(*0*,* <sup>1</sup>*)* we choose as *<sup>ν</sup>* the counting measure on {0*,...,n*}. We make the following choices: *T (y)* = *y*,

$$a(\mathbf{y}) = \log \binom{n}{\mathbf{y}}, \quad \kappa(\theta) = n \log(1 + e^{\theta}), \quad \mu = \kappa'(\theta) = \frac{n e^{\theta}}{1 + e^{\theta}}, \quad \theta = h(\mu) = \log \left(\frac{\mu}{n - \mu}\right), \quad \mu = \frac{n e^{\theta}}{1 + e^{\theta}}$$

for effective domain <sup>=</sup> <sup>R</sup>, dual parameter space *<sup>M</sup>* <sup>=</sup> *(*0*, n)* and support <sup>T</sup> <sup>=</sup> {0*,...,n*} of *Y* = *T (Y )*. With these choices we have

$$d\,dF(\mathbf{y};\theta) = \binom{n}{\mathbf{y}} \exp\left\{\theta\mathbf{y} - n\log(\mathbf{1} + e^{\theta})\right\} d\nu(\mathbf{y}) = \binom{n}{\mathbf{y}} \left(\frac{e^{\theta}}{\mathbf{I} + e^{\theta}}\right)^{\mathbf{y}} \left(\frac{\mathbf{l}}{\mathbf{I} + e^{\theta}}\right)^{n-\mathbf{y}} d\nu(\mathbf{y}).$$

Mean and variance are given by

$$\mu = \mathbb{E}\_{\theta} \left[ Y \right] = \kappa'(\theta) = np \qquad \text{and} \qquad \text{Var}\_{\theta} \left( Y \right) = \kappa''(\theta) = n \frac{\epsilon^{\theta}}{(1 + \epsilon^{\theta})^2} = np(1 - p),$$

where we set *<sup>p</sup>* <sup>=</sup> *<sup>e</sup><sup>θ</sup> /(*<sup>1</sup> <sup>+</sup> *<sup>e</sup><sup>θ</sup> )*. The probability weights satisfy for *<sup>y</sup>* <sup>∈</sup> <sup>T</sup> <sup>=</sup> {0*,...,n*}

$$\mathbb{P}\_{\theta}[Y=\mathsf{y}] = \binom{n}{\mathsf{y}} p^{\mathsf{y}} (1-p)^{n-\mathsf{y}}.$$

#### **Poisson Distribution as a Single-Parameter Linear EF**

For the Poisson distribution with parameter *λ >* 0 we choose as *ν* the counting measure on <sup>N</sup>0. We make the following choices: *T (y)* <sup>=</sup> *<sup>y</sup>*,

$$a(\mathbf{y}) = \log\left(\frac{1}{\mathbf{y}!}\right), \quad \kappa(\theta) = e^{\theta}, \quad \mu = \kappa'(\theta) = e^{\theta}, \quad \theta = h(\mu) = \log(\mu), \dots$$

for effective domain <sup>=</sup> <sup>R</sup>, dual parameter space *<sup>M</sup>* <sup>=</sup> *(*0*,*∞*)* and support <sup>T</sup> <sup>=</sup> <sup>N</sup><sup>0</sup> of *<sup>Y</sup>* <sup>=</sup> *T (Y )*. With these choices we have

$$dF(\mathbf{y};\theta) = \frac{1}{\mathbf{y}!} \exp\left\{\theta \mathbf{y} - e^{\theta}\right\} d\nu(\mathbf{y}) = e^{-\mu} \frac{\mu^{\mathbf{y}}}{\mathbf{y}!} d\nu(\mathbf{y}).\tag{2.4}$$

The canonical link *μ* → *h(μ)* is the log-link. Mean and variance are given by

$$
\mu = \mathbb{E}\_{\theta} \left[ Y \right] = \kappa'(\theta) = \lambda \qquad \text{and} \qquad \text{Var}\_{\theta} \left( Y \right) = \kappa''(\theta) = \lambda = \mu = \mathbb{E}\_{\theta} \left[ Y \right],
$$

where we set *<sup>λ</sup>* <sup>=</sup> *<sup>e</sup><sup>θ</sup>* . The probability weights in the Poisson case satisfy for *<sup>y</sup>* <sup>∈</sup> <sup>T</sup> <sup>=</sup> <sup>N</sup><sup>0</sup>

$$\mathbb{P}\_{\theta}[Y=\mathsf{y}] = e^{-\lambda} \frac{\lambda^{\mathsf{y}}}{\mathsf{y}!}.$$

#### **Negative-Binomial (Pólya) Distribution as a Single-Parameter Linear EF**

For the negative-binomial distribution with *α >* 0 and *p* ∈ *(*0*,* 1*)* we choose as *ν* the counting measure on N0; *α* plays the role of a nuisance parameter or hyperparameter. We make the following choices: *T (y)* = *y*,

$$a(\mathbf{y}) = \log \binom{\mathbf{y} + \alpha - 1}{\mathbf{y}}, \ \kappa(\theta) = -\alpha \log(1 - e^{\theta}),$$

$$
\mu = \kappa'(\theta) = \alpha \frac{e^{\theta}}{1 - e^{\theta}}, \ \theta = h(\mu) = \log\left(\frac{\mu}{\mu + \alpha}\right),
$$

for effective domain = *(*−∞*,* 0*)*, dual parameter space *M* = *(*0*,*∞*)* and support <sup>T</sup> <sup>=</sup> <sup>N</sup><sup>0</sup> of *<sup>Y</sup>* <sup>=</sup> *T (Y )*. With these choices we have

$$dF(\mathbf{y};\theta) = \begin{pmatrix} \mathbf{y} + \alpha - 1\\ \mathbf{y} \end{pmatrix} \exp\left\{\theta \mathbf{y} + \alpha \log(1 - e^{\theta})\right\} d\boldsymbol{\nu}(\mathbf{y}),$$

$$= \begin{pmatrix} \mathbf{y} + \alpha - 1\\ \mathbf{y} \end{pmatrix} p^{\mathbf{y}} (1 - p)^{\alpha} d\boldsymbol{\nu}(\mathbf{y}),$$

with *<sup>p</sup>* <sup>=</sup> *<sup>e</sup><sup>θ</sup>* . Parameter *α >* 0 is treated as nuisance parameter, otherwise we drop out of the EF framework. We have first the two moments

$$\mu = \mathbb{E}\_{\theta}[Y] = \alpha \frac{e^{\theta}}{1 - e^{\theta}} = \alpha \frac{p}{1 - p} \quad \text{and} \quad \text{Var}\_{\theta}(Y) = \mathbb{E}\_{\theta}[Y] \left( \mathbb{I} + \frac{e^{\theta}}{\mathbb{I} - e^{\theta}} \right) > \mathbb{E}\_{\theta}[Y].$$

This model allows us to model over-dispersion, in contrast to the Poisson model. In fact, the negative-binomial model is a mixed Poisson model with a gamma mixing distribution, for details see Sect. 5.3.5, below. Typically, one uses a different parametrization. Set *<sup>e</sup><sup>θ</sup>* <sup>=</sup> *λ/(α* <sup>+</sup> *λ)*, for *λ >* 0. This implies

$$
\lambda \mu = \mathbb{E}\_{\theta} [Y] = \lambda \qquad \text{and} \qquad \text{Var}\_{\theta}(Y) = \lambda \left( 1 + \frac{\lambda}{\alpha} \right) > \lambda \dots
$$

For *<sup>α</sup>* <sup>∈</sup> <sup>N</sup> this model can also be interpreted as the waiting time until we observe *α* successful trials among i.i.d. trials, for instance, for *α* = 1 we have the geometric distribution (with a small reparametrization).

The probability weights of the negative-binomial model satisfy for *<sup>y</sup>* <sup>∈</sup> <sup>T</sup> <sup>=</sup> <sup>N</sup><sup>0</sup>

$$\mathbb{P}\_{\theta}[Y=y] = \binom{y+\alpha-1}{y} p^y \left(1-p\right)^{\alpha}.\tag{2.5}$$

## *2.1.3 Vector-Valued Parameter EF: Absolutely Continuous Examples*

We give vector-valued parameter absolutely continuous EF examples with *k* = 2, and being based on the Lebesgue measure on (subsets of) R, in this section.

#### **Gaussian Distribution as a Vector-Valued Parameter EF**

For the Gaussian distribution with parameters *<sup>μ</sup>* <sup>∈</sup> <sup>R</sup> and *<sup>σ</sup>*<sup>2</sup> *<sup>&</sup>gt;* 0 we choose as *<sup>ν</sup>* the Lebesgue measure on <sup>R</sup>, and we make the following choices: *T (y)* <sup>=</sup> *(y, y*2*)* -,

$$a(\mathbf{y}) = -\frac{1}{2}\log(2\pi), \qquad \kappa(\theta) = -\frac{\theta\_1^2}{4\theta\_2} - \frac{1}{2}\log(-2\theta\_2),$$

$$(\mu, \sigma^2 + \mu^2)^\top = \nabla\_\theta \kappa(\theta) \ = \left(\frac{\theta\_1}{-2\theta\_2}, (-2\theta\_2)^{-1} + \frac{\theta\_1^2}{4\theta\_2^2}\right)^\top,$$

for effective domain <sup>=</sup> <sup>R</sup> <sup>×</sup> *(*−∞*,* <sup>0</sup>*)*, dual parameter space *<sup>M</sup>* <sup>=</sup> <sup>R</sup> <sup>×</sup> *(*0*,*∞*)* and support <sup>T</sup> <sup>=</sup> <sup>R</sup> × [0*,*∞*)* of *T (Y )* <sup>=</sup> *(Y, Y* <sup>2</sup>*)*-. With these choices we have

$$dF(\mathbf{y};\boldsymbol{\theta}) = \frac{1}{\sqrt{2\pi}} \exp\left\{\boldsymbol{\theta}^{\top}T(\mathbf{y}) + \frac{\theta\_1^2}{4\theta\_2} + \frac{1}{2}\log(-2\theta\_2)\right\} d\boldsymbol{\nu}(\mathbf{y})$$

$$= \frac{1}{\sqrt{2\pi}(-2\theta\_2)^{-1/2}} \exp\left\{-\frac{1}{2}\frac{1}{(-2\theta\_2)^{-1}} \left(\mathbf{y} - \frac{\theta\_1}{-2\theta\_2}\right)^2\right\} d\boldsymbol{\nu}(\mathbf{y}).$$

This is the Gaussian model with mean *<sup>μ</sup>* <sup>=</sup> *<sup>θ</sup>*1*/(*−2*θ*2*)* and variance *<sup>σ</sup>*<sup>2</sup> <sup>=</sup> *(*−2*θ*2*)*−1.

If we treat *σ >* 0 as a nuisance parameter, we obtain the Gaussian model as a single-parameter EF. This is the most common example of an EF. Set *T (y)* = *y/σ* and

$$a(\mathbf{y}) = -\frac{1}{2}\log(2\pi\sigma^2) - \mathbf{y}^2/(2\sigma^2), \quad \kappa(\theta) = \theta^2/2, \quad \mu = \kappa'(\theta) = \theta, \quad \theta = h(\mu) = \mu, \quad \mu = \theta', \quad \kappa(\theta) = \theta'$$

for effective domain <sup>=</sup> <sup>R</sup>, dual parameter space *<sup>M</sup>* <sup>=</sup> <sup>R</sup> and support <sup>T</sup> <sup>=</sup> <sup>R</sup> of *T (Y )* = *Y/σ*. With these choices we have

$$dF(\mathbf{y};\theta) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left\{\theta\mathbf{y}/\sigma - \mathbf{y}^2/(2\sigma^2) - \theta^2/2\right\} d\boldsymbol{\nu}(\mathbf{y}),$$

$$= \frac{1}{\sqrt{2\pi}\sigma} \exp\left\{-\frac{1}{2\sigma^2} \left(\mathbf{y} - \sigma\theta\right)^2\right\} d\boldsymbol{\nu}(\mathbf{y}),$$

and, in particular, the canonical link is the *identity link μ* → *θ* = *h(μ)* = *μ* in this single-parameter EF example.

#### **Gamma Distribution as a Vector-Valued Parameter EF**

For the gamma distribution with parameters *α, β >* 0 we choose as *ν* the Lebesgue measure on <sup>R</sup>+. Then we make the following choices: *T (y)* <sup>=</sup> *(y,*log*y)*-,

$$a(\mathbf{y}) = -\text{log}\mathbf{y}, \qquad \kappa(\boldsymbol{\theta}) = \log \Gamma(\theta\_2) - \theta\_2 \text{log}(-\theta\_1),$$

$$\left(\alpha/\beta, \frac{\Gamma'(\alpha)}{\Gamma(\alpha)} - \log(\beta)\right)^\top = \nabla\_{\boldsymbol{\theta}} \kappa(\boldsymbol{\theta}) \ = \left(\frac{\theta\_2}{-\theta\_1}, \frac{\Gamma'(\theta\_2)}{\Gamma(\theta\_2)} - \log(-\theta\_1)\right)^\top,$$

for effective domain = *(*−∞*,* 0*)* × *(*0*,*∞*)*, and setting *β* = −*θ*<sup>1</sup> *>* 0 and *<sup>α</sup>* <sup>=</sup> *<sup>θ</sup>*<sup>2</sup> *<sup>&</sup>gt;* 0. The dual parameter space is *<sup>M</sup>* <sup>=</sup> *(*0*,*∞*)* <sup>×</sup> <sup>R</sup>, and we have support <sup>T</sup> <sup>=</sup> *(*0*,*∞*)* <sup>×</sup> <sup>R</sup> of *T (Y )* <sup>=</sup> *(Y,*log*Y )*-. With these choices we obtain

$$dF(\mathbf{y};\boldsymbol{\theta}) = \exp\left\{\boldsymbol{\theta}^{\top}T(\mathbf{y}) - \log\Gamma(\theta\_2) + \theta\_2\log(-\theta\_1) - \log\mathbf{y}\right\}d\boldsymbol{\nu}(\mathbf{y}),$$

$$= \frac{(-\theta\_1)^{\theta\_2}}{\Gamma(\theta\_2)}\mathbf{y}^{\theta\_2 - 1}\exp\left\{-(-\theta\_1)\mathbf{y}\right\}d\boldsymbol{\nu}(\mathbf{y})$$

$$= \frac{\boldsymbol{\theta}^{\alpha}}{\Gamma(\alpha)}\mathbf{y}^{\alpha - 1}\exp\left\{-\beta\mathbf{y}\right\}d\boldsymbol{\nu}(\mathbf{y}).$$

This is a vector-valued parameter EF with *k* = 2, and the first moment is given by

$$\mathbb{E}\_{\theta}\left[\left(Y,\log Y\right)^{\top}\right] = \nabla\_{\theta} \kappa(\theta) = \left(\alpha/\beta, \frac{\Gamma'(\alpha)}{\Gamma(\alpha)} - \log(\beta)\right)^{\top}.$$

Parameter *α* is called *shape parameter* and parameter *β* is called *scale parameter*. 2

If we treat the shape parameter *α >* 0 as a nuisance parameter we can turn the gamma distribution into a single-parameter linear EF. Set *T (y)* = *y* and

$$a(\mathbf{y}) = (a - 1)\log y - \log \Gamma(a), \; \kappa(\theta) = -a\log(-\theta), \; \mu = \kappa'(\theta) = \frac{a}{-\theta}, \; \theta = h(\mu) = -\frac{a}{\mu}, \; \mu = -\frac{a}{\mu}$$

for effective domain = *(*−∞*,* 0*)*, dual parameter space *M* = *(*0*,*∞*)* and support <sup>T</sup> <sup>=</sup> *(*0*,*∞*)*. With these choices we have for *<sup>β</sup>* = −*θ >* <sup>0</sup>

$$dF(\mathbf{y};\theta) = \frac{(-\theta)^{\alpha}}{\Gamma(\alpha)} \mathbf{y}^{\alpha - 1} \exp\left\{-(-\theta)\mathbf{y}\right\} d\nu(\mathbf{y}).\tag{2.6}$$

This provides us with mean and variance

$$
\mu = \mathbb{E}\_{\theta}[Y] = \frac{\alpha}{\beta} \qquad \text{and} \qquad \sigma^2 = \text{Var}\_{\theta}(Y) = \frac{\alpha}{\beta^2} = \frac{1}{\alpha}\mu^2.
$$

<sup>2</sup> The function *(x)* <sup>=</sup> *<sup>d</sup> dx* log*(x)* =  *(x)/ 
(x)* is called digamma function. For parameter estimation one often needs to invert these identities which gives us

$$
\alpha = \frac{\mu^2}{\sigma^2} \qquad \text{and} \qquad \beta = \frac{\mu}{\sigma^2}.
$$

#### *Remarks 2.10*


$$dF(\mathbf{y}; \vartheta) = \frac{\mathbf{y}^{\alpha - 1}}{\Gamma(\alpha)} \exp\left\{-e^{-\vartheta}\mathbf{y} - \alpha \vartheta\right\} d\nu(\mathbf{y}).\tag{2.7}$$

We will study the gamma model in more depth below, and parametrization (2.7) will correspond to the log-link choice, see Example 5.5, below.

Figure 2.1 gives examples of gamma densities for shape parameters *α* ∈ {1*/*2*,* 1*,* 3*/*2*,* 2} and scale parameters *β* ∈ {1*/*2*,* 1*,* 3*/*2*,* 2} with *α* = *β* all providing the same mean *<sup>μ</sup>* <sup>=</sup> <sup>E</sup>*<sup>θ</sup>* [*<sup>Y</sup>* ] = *α/β* <sup>=</sup> 1. The crucial observation is that these gamma densities can have two different shapes, for *α* ≤ 1 we have a strictly decreasing shape and for *α >* 1 we have a unimodal density with mode in *(α* − 1*)/β*.

#### **Inverse Gaussian Distribution as a Vector-Valued Parameter EF**

For the inverse Gaussian distribution with parameters *α, β >* 0 we choose as *ν* the Lebesgue measure on <sup>R</sup>+. Then we make the following choices: *T (y)* <sup>=</sup> *(y,* <sup>1</sup>*/y)*-,

$$a(\mathbf{y}) = -\frac{1}{2}\log(2\pi\mathbf{y}^3), \quad \kappa(\boldsymbol{\theta}) = \left(-2(\theta\_1\theta\_2)^{1/2} - \frac{1}{2}\log(-2\theta\_2)\right)$$

$$\left(\alpha/\beta, \beta/\alpha + 1/\alpha^2\right)^\top = \nabla\_\theta\kappa(\boldsymbol{\theta}) \ = \left(\left(\frac{-2\theta\_2}{-2\theta\_1}\right)^{1/2}, \left(\frac{-2\theta\_1}{-2\theta\_2}\right)^{1/2} + \frac{1}{-2\theta\_2}\right)^\top,$$

for *θ* = *(θ*1*, θ*2*)*- <sup>∈</sup> *(*−∞*,* <sup>0</sup>*)*2, and setting *<sup>β</sup>* <sup>=</sup> *(*−2*θ*1*)*1*/*<sup>2</sup> and *<sup>α</sup>* <sup>=</sup> *(*−2*θ*2*)*1*/*2. The dual parameter space is *<sup>M</sup>* <sup>=</sup> *(*0*,*∞*)*2, and we have support <sup>T</sup> <sup>=</sup> *(*0*,*∞*)*<sup>2</sup> of *T (Y )* = *(Y,* 1*/Y)*-. With these choices we obtain

$$dF(\mathbf{y};\boldsymbol{\theta}) = \exp\left\{\theta^{\top}T(\mathbf{y}) + 2(\theta\_{1}\theta\_{2})^{1/2} + \frac{1}{2}\log(-2\theta\_{2}) - \frac{1}{2}\log(2\pi\mathbf{y}^{3})\right\}d\boldsymbol{\nu}(\mathbf{y})$$

$$= \frac{1}{(2\pi\mathbf{y}^{3})^{1/2}}(-2\theta\_{2})^{1/2}\exp\left\{-\frac{1}{2\mathbf{y}}\left((-2\theta\_{1})\mathbf{y}^{2} + (-2\theta\_{2}) - 4(\theta\_{1}\theta\_{2})^{1/2}\mathbf{y}\right)\right\}d\boldsymbol{\nu}(\mathbf{y})$$

$$= \frac{a}{(2\pi\mathbf{y}^{3})^{1/2}}\exp\left\{-\frac{a^{2}}{2\mathbf{y}}\left(1 - \frac{\beta}{a}\mathbf{y}\right)^{2}\right\}d\boldsymbol{\nu}(\mathbf{y}).\tag{2.8}$$

This is a vector-valued parameter EF with *k* = 2 and with first moment

$$\mathbb{E}\_{\theta} \left[ \left( Y, 1/Y \right)^{\top} \right] = \nabla\_{\theta} \kappa \left( \theta \right) = \left( \alpha/\beta, \beta/\alpha + 1/\alpha^2 \right)^{\top}.$$

For receiving (2.8) we have chosen canonical parameter *θ* = *(θ*1*, θ*2*)*- <sup>∈</sup> *(*−∞*,* <sup>0</sup>*)*2. Interestingly, we can close this parameter space for *θ*<sup>1</sup> = 0, i.e., the effective domain is not open in this example. The choice *θ*<sup>1</sup> = 0 gives us cumulant function *κ(θ)* = −1 <sup>2</sup> log*(*−2*θ*2*)* and boundary case

$$dF(\mathbf{y};\boldsymbol{\theta}) = \exp\left\{\boldsymbol{\theta}^{\top}T(\mathbf{y}) + \frac{1}{2}\log(-2\theta\_2) - \frac{1}{2}\log(2\pi\mathbf{y}^3)\right\}d\boldsymbol{\nu}(\mathbf{y})$$

$$= \frac{1}{(2\pi\mathbf{y}^3)^{1/2}}(-2\theta\_2)^{1/2}\exp\left\{-\frac{-2\theta\_2}{2\mathbf{y}}\right\}d\boldsymbol{\nu}(\mathbf{y})$$

$$= \frac{\alpha}{(2\pi\mathbf{y}^3)^{1/2}}\exp\left\{-\frac{\alpha^2}{2\mathbf{y}}\right\}d\boldsymbol{\nu}(\mathbf{y}).\tag{2.9}$$

#### 2.1 Exponential Family 25

This is the distribution of the first-passage time of level *α >* 0 of a standard Brownian motion, see Bachelier [20]; this distribution is also known as Lévy distribution.

If we treat *α >* 0 as a nuisance parameter, we can turn the inverse Gaussian distribution into a single-parameter linear EF by setting *T (y)* = *y*,

$$a(\mathbf{y}) = \log\left(\frac{\alpha}{(2\pi \mathbf{y}^3)^{1/2}}\right) - \frac{\alpha^2}{2\mathbf{y}}, \ \kappa(\theta) = -\alpha(-2\theta)^{1/2},$$

$$\mu = \kappa'(\theta) = \frac{\alpha}{(-2\theta)^{1/2}}, \ \theta = h(\mu) = -\frac{1}{2}\frac{\alpha^2}{\mu^2},$$

for *<sup>θ</sup>* <sup>∈</sup> *(*−∞*,* <sup>0</sup>*)*, dual parameter space *<sup>M</sup>* <sup>=</sup> *(*0*,*∞*)* and support <sup>T</sup> <sup>=</sup> *(*0*,*∞*)*. With these choices we have the inverse Gaussian model for *<sup>β</sup>* <sup>=</sup> *(*−2*θ )*1*/*<sup>2</sup> *<sup>&</sup>gt;* <sup>0</sup>

$$dF(\mathbf{y};\theta) = \exp\{a(\mathbf{y})\} \exp\left\{-\frac{1}{2\mathbf{y}}\left((-2\theta)\mathbf{y}^2 - 2a(-2\theta)^{1/2}\mathbf{y}\right)\right\} d\nu(\mathbf{y})$$

$$= \frac{\alpha}{(2\pi\mathbf{y}^3)^{1/2}} \exp\left\{-\frac{\alpha^2}{2\mathbf{y}}\left(1 - \frac{\beta}{\alpha}\mathbf{y}\right)^2\right\} d\nu(\mathbf{y}).$$

This provides us with mean and variance

$$\mu = \mathbb{E}\_{\theta}[Y] = \frac{\alpha}{\beta} \qquad \text{and} \qquad \sigma^2 = \text{Var}\_{\theta}(Y) = \frac{\alpha}{\beta^3} = \frac{1}{\alpha^2} \mu^3.$$

For parameter estimation one often needs to invert these identities, which gives us

$$
\alpha = \frac{\mu^{3/2}}{\sigma} \qquad \text{and} \qquad \beta = \frac{\mu^{1/2}}{\sigma}.
$$

Figure 2.2 gives examples of inverse Gaussian densities for parameter choices *<sup>α</sup>* <sup>=</sup> *<sup>β</sup>* ∈ {1*/*2*,* <sup>1</sup>*,* <sup>3</sup>*/*2*,* <sup>2</sup>} all providing the same mean *<sup>μ</sup>* <sup>=</sup> <sup>E</sup>*<sup>θ</sup>* [*<sup>Y</sup>* ] = *α/β* <sup>=</sup> 1.

#### **Generalized Inverse Gaussian Distribution as a Vector-Valued Parameter EF**

For the generalized inverse Gaussian distribution with parameters *α, β >* 0 and *<sup>γ</sup>* <sup>∈</sup> <sup>R</sup> we choose as *<sup>ν</sup>* the Lebesgue measure on <sup>R</sup>+. We combine the terms of the gamma and the inverse Gaussian models to the vector-valued choice: *T (y)* = *(y,*log*y,* 1*/y)* with *k* = 3. Moreover, we choose *a(y)* = −log*y* and cumulant function

$$\kappa(\theta) = \log\left(2K\_{\theta\_2}(2\sqrt{\theta\_1\theta\_3})\right) - \frac{\theta\_2}{2}\log(\theta\_1/\theta\_3),$$

**Fig. 2.2** Inverse Gaussian densities for parameters *α* = *β* ∈ {1*/*2*,* 1*,* 3*/*2*,* 2} all providing the same mean *μ* = *α/β* = 1

for *θ* = *(θ*1*, θ*2*, θ*3*)*- <sup>∈</sup> *(*−∞*,* <sup>0</sup>*)* <sup>×</sup> <sup>R</sup> <sup>×</sup> *(*−∞*,* <sup>0</sup>*)*, and where *Kθ*<sup>2</sup> denotes the modified Bessel function of the second kind with index *<sup>γ</sup>* <sup>=</sup> *<sup>θ</sup>*<sup>2</sup> <sup>∈</sup> <sup>R</sup>. With these choices we obtain generalized inverse Gaussian density

$$dF(\mathbf{y};\boldsymbol{\theta}) = \exp\left\{\boldsymbol{\theta}^{\top}T(\mathbf{y}) - \log\left(2K\_{\theta\_2}(2\sqrt{\theta\_1\theta\_3})\right) + \frac{\theta\_2}{2}\log(\theta\_1/\theta\_3) - \log\mathbf{y}\right\}d\boldsymbol{\nu}(\mathbf{y})$$

$$= \frac{(\alpha/\beta)^{\mathbf{y}/2}}{2K\_{\mathbf{y}}(\sqrt{\alpha\beta})}\mathbf{y}^{\mathbf{y}-1}\exp\left\{-\frac{1}{2}\left(\alpha\mathbf{y}+\beta\mathbf{y}^{-1}\right)\right\}d\boldsymbol{\nu}(\mathbf{y}),\tag{2.10}$$

setting *α* = −2*θ*<sup>1</sup> and *β* = −2*θ*3. This is a vector-valued parameter EF with *k* = 3, and the first moment is given by

$$\begin{split} & \mathbb{E}\_{\theta} \left[ \left( Y, \log Y, \frac{1}{Y} \right)^{\top} \right] = \nabla\_{\theta} \kappa(\theta) \\ &= \left( \frac{K\_{\mathcal{Y}+1}(\sqrt{\alpha \beta})}{K\_{\mathcal{Y}}(\sqrt{\alpha \beta})} \sqrt{\frac{\beta}{\alpha}}, \log \sqrt{\frac{\beta}{\alpha}} + \frac{\partial}{\partial \mathcal{Y}} \log K\_{\mathcal{Y}}(\sqrt{\alpha \beta}), \frac{K\_{\mathcal{Y}+1}(\sqrt{\alpha \beta})}{K\_{\mathcal{Y}}(\sqrt{\alpha \beta})} \sqrt{\frac{\alpha}{\beta}} - \frac{2\gamma}{\beta} \right)^{\top} . \end{split}$$

The effective domain is a bit complicated because the possible choices of *(θ*1*, θ*3*)* depend on *<sup>θ</sup>*<sup>2</sup> <sup>∈</sup> <sup>R</sup>, namely, for *<sup>θ</sup>*<sup>2</sup> *<sup>&</sup>lt;* 0 the negative half-line *(*−∞*,* <sup>0</sup>] can be closed at the origin for *θ*1, and for *θ*<sup>2</sup> *>* 0 it can be closed at the origin for *θ*3. The inverse Gaussian model is obtained for *θ*<sup>2</sup> = −1*/*2 and the gamma model is obtained for *θ*<sup>3</sup> = 0. For further properties of the generalized inverse Gaussian distribution we refer to the textbook of Jørgensen [200].

## *2.1.4 Vector-Valued Parameter EF: Count Variable Example*

We close our EF examples by giving a discrete example with a vector-valued parameter.

#### **Categorical Distribution as a Vector-Valued Parameter EF**

For the categorical distribution with *<sup>k</sup>* <sup>∈</sup> <sup>N</sup> and *<sup>p</sup>* <sup>∈</sup> *(*0*,* <sup>1</sup>*)<sup>k</sup>* such that *<sup>k</sup> <sup>i</sup>*=<sup>1</sup> *pi <sup>&</sup>lt;* 1, we choose as *ν* the counting measure on the finite set {1*,...,k* + 1}. Then we make the following choices: *T (y)* <sup>=</sup> *(*1{*y*=1}*,...,* <sup>1</sup>{*y*=*k*}*)*- <sup>∈</sup> <sup>R</sup>*k*, *<sup>θ</sup>* <sup>=</sup> *(θ*1*,...,θk)*-, *<sup>e</sup><sup>θ</sup>* <sup>=</sup> *(eθ*<sup>1</sup> *,...,eθk )*and

$$a(\mathbf{y}) = 0, \qquad \kappa(\boldsymbol{\theta}) = \log\left(1 + \sum\_{l=1}^{k} e^{\theta\_l}\right), \qquad \mathbf{p} = \nabla\_{\boldsymbol{\theta}} \kappa(\boldsymbol{\theta}) = \frac{e^{\boldsymbol{\theta}}}{1 + \sum\_{l=1}^{k} e^{\theta\_l}},$$

for effective domain <sup>=</sup> <sup>R</sup>*k*, dual parameter space *<sup>M</sup>* <sup>=</sup> *(*0*,* <sup>1</sup>*)k*, and the support <sup>T</sup> of *T (Y )* are the *<sup>k</sup>* <sup>+</sup> 1 corners of the unit simplex in <sup>R</sup>*k*. This representation is minimal, see Assumption 2.6. With these choices we have (set *θk*+<sup>1</sup> = 0)

$$dF(\mathbf{y}; \boldsymbol{\theta}) = \exp\left\{\theta^\top T(\mathbf{y}) - \log\left(\mathbf{l} + \sum\_{i=1}^k e^{\theta\_i}\right)\right\} d\boldsymbol{\nu}(\mathbf{y}) = \prod\_{j=1}^{k+1} \left(\frac{e^{\theta\_j}}{\sum\_{i=1}^{k+1} e^{\theta\_i}}\right)^{\mathbf{1}\_{\{\boldsymbol{\nu} = j\}}} d\boldsymbol{\nu}(\mathbf{y}).$$

This is a vector-valued parameter EF with *<sup>k</sup>* <sup>∈</sup> <sup>N</sup>. The canonical link is slightly more complicated. Set vectors *<sup>v</sup>* <sup>=</sup> exp{*θ*} ∈ <sup>R</sup>*<sup>k</sup>* and *<sup>w</sup>* <sup>=</sup> *(*1*,...,* <sup>1</sup>*)*- <sup>∈</sup> <sup>R</sup>*k*. This provides *<sup>p</sup>* = ∇*<sup>θ</sup> κ(θ)* <sup>=</sup> <sup>1</sup> 1+*w<sup>v</sup> <sup>v</sup>* <sup>∈</sup> <sup>R</sup>*k*. Set matrix *<sup>A</sup><sup>p</sup>* <sup>=</sup> <sup>1</sup> <sup>−</sup> *pw*- <sup>∈</sup> <sup>R</sup>*k*×*<sup>k</sup>* , the latter gives us *p* = *Apv*, and since *A<sup>p</sup>* has full rank *k*, we obtain canonical link

$$p \mapsto \theta = h(p) = \log\left(A\_p^{-1}p\right) = \log\left(\frac{p}{1 - w^\top p}\right).$$

The last identity can be verified by explicit calculation

$$\log\left(\frac{p}{1-\mathbf{w}^{\top}p}\right) = \log\left(\frac{e^{\theta}/(1+\sum\_{j=1}^{k}e^{\theta\_{j}})}{1-\sum\_{l=1}^{k}e^{\theta\_{l}}/(1+\sum\_{j=1}^{k}e^{\theta\_{j}})}\right) = \log\left(e^{\theta}\right) = \theta.$$

*Remarks 2.11*

• There are many more examples that belong to the EF. From Theorem 2.4, we know that all examples of the EF are light-tailed in the sense that all moments of *T (Y )* exist. If we want to model heavy-tailed distributions within the EF, we first need to apply a suitable transformation. We could model the Pareto distribution using transformation *T (y)* = log*y*, and assuming that the transformed random variable has an exponential distribution. Different light-tailed examples are obtained by, e.g., using transformation *T (y)* <sup>=</sup> *<sup>y</sup><sup>τ</sup>* for the Weibull distribution or *T (y)* = *(*log*y,*log*(*1 − *y))* for the beta distribution. We refrain from giving explicit formulas for these or other examples.

• Observe that in all examples above we have <sup>T</sup> <sup>⊂</sup> *<sup>M</sup>*, i.e., the support of *T (Y )* is contained in the closure of the dual parameter space *M*, we come back to this observation in Sect. 2.2.4, below.

## **2.2 Exponential Dispersion Family**

In the previous section we have introduced the EF, and we have explicitly studied the vector-valued parameter EF examples of the Gaussian, the gamma and the inverse Gaussian models. We have highlighted that these three vector-valued parameter EFs can be turned into single-parameter EFs by declaring one parameter to be a nuisance parameter that is not modeled (and acts as a hyper-parameter). This changes these three models into single-parameter EFs. These three single-parameter EFs with nuisance parameter can also be interpreted as EDF models. In this section we discuss the single-parameter EDF; this is sufficient for our purposes, and vectorvalued parameter extensions can be obtained in a canonical way.

## *2.2.1 Definition and Properties*

The EFs of Sect. 2.1 can be extended to EDFs. In the single-parameter case this is achieved by a transformation *Y* = *X/ω*, where *ω >* 0 is a scaling and where *X* belongs to a single-parameter linear EF, i.e., with *T (x)* = *x*. We restrict ourselves to the single-parameter case *k* = 1 throughout this section. Choose a *σ*-finite measure *<sup>ν</sup>*<sup>1</sup> on <sup>R</sup> and a measurable function *<sup>a</sup>*<sup>1</sup> : <sup>R</sup> <sup>→</sup> <sup>R</sup>. These choices give a singleparameter linear EF, directly modeling a real-valued random variable *T (X)* = *X*. By (2.2) we have distribution for the single-parameter linear EF random variable *X*

$$dF(\mathbf{x}; \theta, 1) = f(\mathbf{x}; \theta, 1)d\nu\_1(\mathbf{x}) = \exp\left\{\theta \mathbf{x} - \kappa(\theta) + a\_1(\mathbf{x})\right\}d\nu\_1(\mathbf{x}),$$

on the effective domain

$$\Theta = \left\{ \theta \in \mathbb{R} ; \int\_{\mathbb{R}} \exp\left\{ \theta x + a\_{\mathbb{I}}(\mathbf{x}) \right\} d\nu\_{\mathbb{I}}(\mathbf{x}) < \infty \right\},\tag{2.11}$$

and with cumulant function

$$\theta \in \Theta \iff \kappa(\theta) = \log \left( \int\_{\mathbb{R}} \exp \left\{ \theta \ge a\_1(\mathbf{x}) \right\} d\nu\_1(\mathbf{x}) \right). \tag{2.12}$$

Throughout, we assume that the effective domain has a non-empty interior ˚. Thus, since is convex, we assume that ˚ is a non-empty (possibly infinite) open interval in R.

Following Jørgensen [201, 202], we extend this linear EF to an EDF as follows. Choose a family of *<sup>σ</sup>*-finite measures *νω* on <sup>R</sup> and measurable functions *aω* : <sup>R</sup> <sup>→</sup> <sup>R</sup> for a given index set *<sup>W</sup> <sup>ω</sup>* with {1} ⊂ *<sup>W</sup>* <sup>⊂</sup> <sup>R</sup>+. Assume that we have an *ω*-independent scaled cumulant function *κ* on this index set *W*, that is,

$$\theta \in \Theta \mapsto \kappa(\theta) = \frac{1}{\omega} \left( \log \int\_{\mathbb{R}} \exp \{ \theta \mathbf{x} + a\_{\omega}(\mathbf{x}) \} \, d\boldsymbol{\nu}\_{\omega}(\mathbf{x}) \right) \qquad \text{for all } \omega \in \mathcal{W},$$

with effective domain defined by (2.11), i.e., for *ω* = 1. This allows us to consider the distribution functions

$$dF(\mathbf{x};\theta,\omega) = f(\mathbf{x};\theta,\omega)d\upsilon\_{\omega}(\mathbf{x}) = \exp\left\{\theta\mathbf{x} - \omega\kappa(\theta) + a\_{\omega}(\mathbf{x})\right\}d\upsilon\_{\omega}(\mathbf{x})$$

$$= \exp\left\{\omega\left(\theta\mathbf{y} - \kappa(\theta)\right) + a\_{\omega}(\omega\mathbf{y})\right\}d\upsilon\_{\omega}(\omega\mathbf{y}),\tag{2.13}$$

in the third identity we did a change of variable *x* → *y* = *x/ω*. By reparametrizing the function *aω(ω* ·*)* and the *σ*-finite measures *νω(ω* ·*)* slightly differently, depending on the particular structure of the chosen *σ*-finite measures, we arrive at the following single-parameter EDF.

**Definition 2.12** The (single-parameter) EDF is given by densities of the form

$$Y \sim f(\mathbf{y}; \theta, v/\varphi) = \exp\left\{ \frac{\mathbf{y}\theta - \kappa(\theta)}{\varphi/v} + a(\mathbf{y}; v/\varphi) \right\},\tag{2.14}$$

with

*<sup>κ</sup>* : <sup>→</sup> <sup>R</sup> is the cumulant function (2.12),

*θ* ∈ is the canonical parameter in the effective domain (2.11),

*v >* 0 is a given weight (exposure, volume),


## *Remarks 2.13*


**Corollary 2.14** *Assume* ˚ *is non-empty and that ν*<sup>1</sup> *is not concentrated in one single point. Choose <sup>Y</sup>* <sup>∼</sup> *F (*·; *θ , v/ϕ) for fixed <sup>θ</sup>* <sup>∈</sup> ˚*. The moment generating function of <sup>Y</sup> for small <sup>r</sup>* <sup>∈</sup> <sup>R</sup> *satisfies*

$$M\_Y(r) = \mathbb{E}\_{\theta} \left[ \exp \left\{ rY \right\} \right] = \exp \left\{ \frac{v}{\varphi} \left[ \kappa (\theta + r\varphi/v) - \kappa(\theta) \right] \right\}.$$

*The first two moments of Y are given by*

$$\mu = \mathbb{E}\_{\theta} \left[ Y \right] = \kappa'(\theta) \qquad \text{and} \qquad \text{Var}\_{\theta} \left( Y \right) = \frac{\varphi}{v} \kappa''(\theta) > 0.$$

*The cumulant function κ is smooth and strictly convex on* ˚ *with canonical link h* = *(κ )*−1*. The variance function is defined by <sup>μ</sup>* <sup>→</sup> *V (μ)* <sup>=</sup> *(κ*◦*h)(μ) and, consequently, for the variance of <sup>Y</sup> we have Var<sup>μ</sup> (Y )* <sup>=</sup> *<sup>ϕ</sup> <sup>v</sup> V (μ) for μ* ∈ *M.*

*Proof* This follows analogously to Theorem 2.4. The linear case *T (y)* = *y* with *ν*<sup>1</sup> not being concentrated in one single point guarantees that the minimal dimension is *k* = 1, providing a minimal representation in this dimension, see Assumption 2.6.

Before giving explicit examples we state the so-called convolution formula.

**Corollary 2.15 (Convolution Formula)** *Assume* ˚ *is non-empty and that ν*<sup>1</sup> *is not concentrated in one single point. Assume that Yi* ∼ *F (*·; *θ,vi/ϕ) are independent, for* <sup>1</sup> <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>, with fixed <sup>θ</sup>* <sup>∈</sup> ˚*. Set <sup>v</sup>*<sup>+</sup> <sup>=</sup> *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *vi. Then*

$$Y\_+ = \frac{1}{v\_+} \sum\_{l=1}^n v\_l Y\_l \sim F(\cdot; \theta, v\_+ / \varphi).$$

*Proof* The proof immediately follows from calculating the moment generating function *MY*<sup>+</sup> *(r)* and from using the independence between the *Yi*'s.

## *2.2.2 Exponential Dispersion Family Examples*

The single-parameter linear EF examples introduced above can be reformulated as EDF examples.

#### **Binomial Distribution as a Single-Parameter EDF**

For the binomial distribution with parameters *<sup>p</sup>* <sup>∈</sup> *(*0*,* <sup>1</sup>*)* and *<sup>n</sup>* <sup>∈</sup> <sup>N</sup> we choose the counting measure on {0*,* 1*/n, . . . ,* 1} with *ω* = *n*. Then we make the following choices

$$a(\mathbf{y}) = \log \binom{n}{\text{ny}}, \quad \kappa(\theta) = \log(1 + \epsilon^{\theta}), \quad p = \kappa'(\theta) = \frac{\epsilon^{\theta}}{1 + \epsilon^{\theta}}, \quad \theta = h(p) = \log \left(\frac{p}{1 - p}\right),$$

for effective domain <sup>=</sup> <sup>R</sup> and dual parameter space *<sup>M</sup>* <sup>=</sup> *(*0*,* <sup>1</sup>*)*. With these choices we have

$$f(\mathbf{y}; \theta, n) = \binom{n}{n\mathbf{y}} \exp\left\{n\left(\theta\mathbf{y} - \log(1 + e^{\theta})\right)\right\} = \binom{n}{n\mathbf{y}} \left(\frac{e^{\theta}}{1 + e^{\theta}}\right)^{n\mathbf{y}} \left(\frac{1}{1 + e^{\theta}}\right)^{n - n\mathbf{y}}.$$

This is a single-parameter EDF. The canonical link *p* → *h(p)* gives the logit function. Mean and variance are given by

$$p = \mathbb{E}\rho\left[Y\right] = \kappa'(\theta) = \frac{e^{\theta}}{1 + e^{\theta}} \quad \text{and} \quad \text{Var}\_{\theta}\left(Y\right) = \frac{1}{n}\kappa'(\theta) = \frac{1}{n}\frac{e^{\theta}}{(1 + e^{\theta})^2} = \frac{1}{n}p(1 - p),$$

and the variance function is given by *V (μ)* = *μ(*1 − *μ)*. The binomial random variable is obtained by setting *X* = *nY* ∼ Binom*(n, p)*.

#### **Poisson Distribution as a Single-Parameter EDF**

For the Poisson distribution with parameters *λ >* 0 and *v >* 0 we choose the counting measure on <sup>N</sup>0*/v* for exposure*<sup>ω</sup>* <sup>=</sup> *<sup>v</sup>*. Then we make the following choices

$$a(\mathbf{y}) = \log\left(\frac{v^{\imath y}}{(v\mathbf{y})!}\right), \quad \kappa(\theta) = e^{\theta}, \quad \lambda = \kappa'(\theta) = e^{\theta}, \quad \theta = h(\lambda) = \log(\lambda), \quad \lambda' = \log(\lambda)$$

for effective domain <sup>=</sup> <sup>R</sup> and dual parameter space *<sup>M</sup>* <sup>=</sup> *(*0*,*∞*)*. With these choices we have

$$f(\mathbf{y}; \theta, \mathbf{v}) = \frac{v^{\upsilon \mathbf{y}}}{(\upsilon \mathbf{y})!} \exp\left\{\upsilon \left(\theta \mathbf{y} - e^{\theta}\right)\right\} = e^{-\upsilon \lambda} \frac{(\upsilon \lambda)^{\upsilon \mathbf{y}}}{(\upsilon \mathbf{y})!}. \tag{2.15}$$

This is a single-parameter EDF. The canonical link *λ* → *h(λ)* is the log-link. Mean and variance are given by

$$
\lambda = \mathbb{E}\_{\theta} \left[ Y \right] = \kappa'(\theta) = e^{\theta} \qquad \text{and} \qquad \text{Var}\_{\theta} \left( Y \right) = \frac{1}{v} \kappa''(\theta) = \frac{1}{v} e^{\theta} = \frac{1}{v} \lambda,
$$

and the variance function is given by *V (λ)* = *λ*, that is, the variance function is linear in the mean parameter *λ*. The Poisson random variable is obtained by setting *X* = *vY* ∼ Poi*(vλ)*. We choose *ϕ* = 1, here, meaning that we have neither undernor over-dispersion. Thus, the choices *v* and *ϕ* in *ω* = *v/ϕ* have the interpretation of an exposure and a dispersion parameter, respectively. This interpretation is going to be important in claim counts modeling, below.

#### **Gamma Distribution as a Single-Parameter EDF**

For the gamma distribution with parameters *α, β >* 0 we choose the Lebesgue measure on <sup>R</sup><sup>+</sup> and shape parameter *<sup>ω</sup>* <sup>=</sup> *v/ϕ* <sup>=</sup> *<sup>α</sup>*. We make the following choices

$$a(\mathbf{y}) = (\alpha - 1)\text{log}y + \alpha \text{log}\alpha - \log \Gamma(\alpha), \ \kappa(\theta) = -\log(-\theta),$$

$$\mu = \kappa'(\theta) = -1/\theta, \ \theta = h(\mu) = -1/\mu,$$

for effective domain = *(*−∞*,* 0*)* and dual parameter space *M* = *(*0*,*∞*)*. With these choices we have

$$f(\mathbf{y}; \theta, \alpha) = \frac{\alpha^{\alpha}}{\Gamma(\alpha)} \mathbf{y}^{\alpha - 1} \exp\left\{ \alpha \left( \mathbf{y} \theta + \log(-\theta) \right) \right\} \\ \\ = \frac{(-\theta \alpha)^{\alpha}}{\Gamma(\alpha)} \mathbf{y}^{\alpha - 1} \exp\left\{ - (-\theta \alpha) \mathbf{y} \right\} \dots$$

This is analogous to (2.6) with shape parameter *α >* 0 and scale parameter *β* = −*θ >* 0. Mean and variance are given by

$$\mu = \mathbb{E}\_{\theta} \left[ Y \right] = \kappa'(\theta) = -\theta^{-1} \qquad \text{and} \qquad \text{Var}\_{\theta} \left( Y \right) = \frac{1}{\alpha} \kappa''(\theta) = \frac{1}{\alpha} \theta^{-2},$$

and the variance function is given by *V (μ)* <sup>=</sup> *<sup>μ</sup>*2, that is, the variance function is quadratic in the mean parameter *μ*. The gamma random variable is obtained by setting *X* = *αY* ∼ *(α, β)*. This gives us for the first two moments of *X*

$$
\mu\_X = \mathbb{E}\_\theta \, [X] = \frac{\alpha}{\beta} \qquad \text{and} \qquad \text{Var}\_\theta \, (X) = \frac{\alpha}{\beta^2} = \frac{1}{\alpha} \mu\_X^2.
$$

Suppose *v* = 1, for shape parameter *α >* 1, we have under-dispersion *ϕ* = 1*/α <* 1 and the gamma density is unimodal; for shape parameter *α <* 1, we have overdispersion *ϕ* = 1*/α >* 1 and the gamma density is strictly decreasing, we refer to Fig. 2.1.

#### **Inverse Gaussian Distribution as a Single-Parameter EDF**

For the inverse Gaussian distribution with parameters *α, β >* 0 we choose the Lebesgue measure on <sup>R</sup><sup>+</sup> and we set *<sup>ω</sup>* <sup>=</sup> *v/ϕ* <sup>=</sup> *<sup>α</sup>*. We make the following choices

$$a(\mathbf{y}) = \log\left(\frac{\alpha^{1/2}}{(2\pi \mathbf{y}^3)^{1/2}}\right) - \frac{\alpha}{2\mathbf{y}}, \ \kappa(\theta) = -(-2\theta)^{1/2},$$

$$\mu = \kappa'(\theta) = \frac{1}{(-2\theta)^{1/2}}, \ \theta = h(\mu) = -\frac{1}{2\mu^2},$$

for *θ* ∈ *(*−∞*,* 0*)* and dual parameter space *M* = *(*0*,*∞*)*. With these choices we have

$$f(\mathbf{y};\theta,\alpha)d\mathbf{y} = \frac{\alpha^{1/2}}{(2\pi\mathbf{y}^3)^{1/2}} \exp\left\{\alpha\left(\theta\mathbf{y} + (-2\theta)^{1/2}\right) - \frac{\alpha}{2\mathbf{y}}\right\}d\mathbf{y}$$

$$= \frac{\alpha^{1/2}}{(2\pi\mathbf{y}^3)^{1/2}} \exp\left\{-\frac{\alpha}{2\mathbf{y}}\left(1 - (-2\theta)^{1/2}\mathbf{y}\right)^2\right\}d\mathbf{y}$$

$$= \frac{\alpha}{(2\pi\mathbf{x}^3)^{1/2}} \exp\left\{-\frac{\alpha^2}{2\mathbf{x}}\left(1 - \frac{(-2\theta)^{1/2}}{\alpha}\mathbf{x}\right)^2\right\}d\mathbf{x},$$

where in the last step we did a change of variable *y* → *x* = *αy*. This is exactly (2.8). Mean and variance are given by

$$\mu = \mathbb{E}\_{\theta} \left[ Y \right] = \kappa'(\theta) = (-2\theta)^{-1/2} \quad \text{and} \quad \text{Var}\_{\theta} \left( Y \right) = \frac{1}{\alpha} \kappa''(\theta) = \frac{1}{\alpha} (-2\theta)^{-3/2},$$

and the variance function is given by *V (μ)* <sup>=</sup> *<sup>μ</sup>*3, that is, the variance function is cubic in the mean parameter *μ*. The inverse Gaussian random variable is obtained by setting *<sup>X</sup>* <sup>=</sup> *αY* . The mean and variance of *<sup>X</sup>* are given by, set *<sup>β</sup>* <sup>=</sup> *(*−2*θ )*1*/*<sup>2</sup> *<sup>&</sup>gt;* 0,

$$
\mu\_X = \mathbb{E}\_\theta \, [X] = \frac{\alpha}{\beta} \qquad \text{and} \qquad \text{Var}\_\theta \, (X) = \frac{\alpha}{\beta^3} = \frac{1}{\alpha^2} \mu\_X^3.
$$

This inverse Gaussian density is illustrated in Fig. 2.2.

Similarly to (2.9), we can extend the inverse Gaussian model to the boundary case *θ* = 0, i.e., the effective domain = *(*−∞*,* 0] is not open. This provides us with density

$$f(\mathbf{y}; \theta = 0, \alpha)d\mathbf{y} = \frac{\alpha}{(2\pi\alpha^3)^{1/2}} \exp\left\{-\frac{\alpha^2}{2\alpha}\right\} d\mathbf{x},\tag{2.16}$$

using, as above, the change of variable *y* → *x* = *αy*. An additional transformation *x* → 1*/x* gives a gamma distribution with shape parameter 1/2 and scale parameter *α*2*/*2.

*Remark 2.16* The inverse Gaussian case gives an example of a non-open effective domain = *(*−∞*,* 0]. It is worth noting that for the boundary parameter *θ* = 0, the first moment does not exist, i.e., Corollary 2.14 only makes statements in the interior ˚ of the effective domain . This also relates to Remarks 2.9 on the dual parameter space *M*.

## *2.2.3 Tweedie's Distributions*

Tweedie's compound Poisson (CP) model was introduced in 1984 by Tweedie [358], and it has been studied in detail in Jørgensen [202], Jørgensen–de Souza [204], Smyth–Jørgensen [342] and in the review paper of Delong et al. [94]. Tweedie's CP model belongs to the EDF. We spend more time on explaining Tweedie's CP model because it plays an important role in actuarial modeling.

Tweedie's CP model is received by choosing as *σ*-finite measure *ν*<sup>1</sup> a mixture of the Lebesgue measure on *(*0*,*∞*)* and a point measure in 0. Furthermore, we choose *power variance parameter p* ∈ *(*1*,* 2*)* and cumulant function

$$\kappa(\theta) = \kappa\_P(\theta) = \frac{1}{2 - p} \left( (1 - p)\theta \right)^{\frac{2 - p}{1 - p}},\tag{2.17}$$

on the effective domain *θ* ∈ = *(*−∞*,* 0*)*. This provides us with Tweedie's CP model

$$Y \sim f(\mathbf{y}; \theta, \upsilon/\varphi) = \exp\left\{ \frac{\mathbf{y}\theta - \kappa\_p(\theta)}{\varphi/\upsilon} + a(\mathbf{y}; \upsilon/\varphi) \right\},$$

with exposure *v >* 0 and dispersion parameter *ϕ >* 0; the normalizing function *a(*·; *v/ϕ)* does not have any simple closed form, we refer to Section 2.1 in Jørgensen–de Souza [204] and Section 4.2 in Jørgensen [203].

The first two moments of Tweedie's CP random variable *Y* are given by

$$\mu = \mathbb{E}\_{\theta} \left[ Y \right] = \kappa\_p'(\theta) = ((1 - p)\theta)^{\frac{1}{1 - p}} \in \mathcal{M} = (0, \infty), (2.18)$$

$$\text{Var}\_{\theta}\left(Y\right) = \frac{\varphi}{v} \kappa\_p''(\theta) = \frac{\varphi}{v} \left((1-p)\theta\right)^{\frac{p}{1-p}} = \frac{\varphi}{v} \mu^p > 0. \tag{2.19}$$

The parameter *p* ∈ *(*1*,* 2*)* determines the power variance functions *V (μ)* = *<sup>μ</sup><sup>p</sup>* between the Poisson *<sup>p</sup>* <sup>=</sup> 1 and the gamma *<sup>p</sup>* <sup>=</sup> 2 cases, see Sect. 2.2.2.

The moment generating function of Tweedie's CP random variable *X* = *vY/ϕ* = *ωY* in its additive form is given by, we use Corollary 2.14,

$$M\_X(r) = M\_{vY/\varphi}(r) = \exp\left\{\frac{v}{\varphi}\kappa\_p(\theta)\left(\left(\frac{-\theta}{-\theta - r}\right)^{\frac{2-p}{p-1}} - 1\right)\right\} \qquad \text{for } r < -\theta.$$

Some readers will notice that this is the moment generating function of a CP distribution having i.i.d. gamma claim sizes. This is exactly the statement of the next proposition which is found, e.g., in Smyth–Jørgensen [342].

**Proposition 2.17** *Assume <sup>S</sup>* <sup>=</sup> *<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *Zi is CP distributed with Poisson claim counts N* ∼ *Poi(λv) and i.i.d. gamma claim sizes Zi* ∼ *(α, β) being independent of <sup>N</sup>. We have <sup>S</sup> (d)* = *vY/ϕ by identifying the parameters as follows*

$$p = \frac{\alpha + 2}{\alpha + 1} \in (1, 2), \qquad \beta = -\theta > 0 \qquad \text{and} \qquad \lambda = \frac{1}{\varphi} \kappa\_p(\theta) > 0.$$

*Proof of Proposition 2.17* Assume *S* is CP distributed with i.i.d. gamma claim sizes. From Proposition 2.11 and Section 3.2.1 in Wüthrich [387] we receive that the moment generating function of *S* is given by

$$M\_S(r) = \exp\left\{\lambda \, v\left(\left(\frac{\beta}{\beta - r}\right)^{\alpha} - 1\right)\right\} \qquad \text{ for } r < \beta.$$

Using the proposed parameter identification, the claim immediately follows.

Proposition 2.17 gives us a second interpretation of Tweedie's CP model which was introduced in an EDF fashion, above. This second interpretation explains the name of this EDF model, it explains the mixture of the Lebesgue measure and the point measure in 0, and it also highlights why the Poisson model and the gamma model are the boundary cases in terms of power variance functions.

An interesting question is whether the EDF can be extended beyond power variance functions *V (μ)* <sup>=</sup> *<sup>μ</sup><sup>p</sup>* with *<sup>p</sup>* ∈ [1*,* <sup>2</sup>]. The answer to this question is yes, and the full answer is provided in Theorem 2 of Jørgensen [202]:

**Theorem 2.18 (Jørgensen [202], Without Proof)** *Only power variance parameters p* ∈ *(*0*,* 1*) do not allow for EDF models.*

Table 2.1 gives the EDF distributions that have a power variance function. These distributions are called *Tweedie's distributions*, with the special case of Tweedie's CP distributions for *p* ∈ *(*1*,* 2*)*. The densities for *p* ∈ {0*,* 1*,* 2*,* 3} have a closed form, but the other Tweedie's distributions do not have a closed-form density. Thus, they cannot explicitly be constructed as suggested in Sect. 2.2.1. Besides the constructive approach presented above, there is a uniqueness theorem saying that the variance function *V (*·*)* on the domain *M* characterizes the single-parameter linear EF, see Theorem 2.11 in Jørgensen [203]. This uniqueness theorem is the basis of the proof of Theorem 2.18. Tweedie's distributions for *p* ∈ [0*,* 1]∪{2*,* 3}involve infinite sums for the normalization exp{*a(*·*,*·*)*}, we refer to formulas (4.19), (4.20) and (4.31) in Jørgensen [203], this is the reason that one has to go via the uniqueness theorem to prove Theorem 2.18. Dunn–Smyth [112] provide methods of fast calculation of some of these infinite sums; in Sect. 5.5.2, below, we present an approximation (saddlepoint approximation). The uniqueness theorem is also useful to construct new examples within the EF, see, e.g., Section 2 of Awad et al. [15].


**Table 2.1** Power variance function models *V (μ)* <sup>=</sup> *μp* within the EDF (taken from Table 4.1 in Jørgensen [203])

## *2.2.4 Steepness of the Cumulant Function*

Assume we have a fixed EF satisfying Assumption 2.6. All random variables *T (Y )* belonging to this EF have the same support, not depending on the particular choice of the canonical parameter *<sup>θ</sup>* <sup>∈</sup> . We denote this support of *T (Y )* by <sup>T</sup>.

Below, we are going to estimate the canonical parameter *θ* ∈ from data using maximum likelihood estimation. For this it is advantageous to have the property <sup>T</sup> <sup>⊂</sup> *<sup>M</sup>*, because, intuitively, this allows us to directly select *<sup>μ</sup>* <sup>=</sup> *T (Y )* as the parameter estimate in the dual parameter space *M*, for a given observation *T (Y )* ∈ <sup>T</sup>. This then translates to a canonical parameter *<sup>θ</sup>* <sup>=</sup> *h( μ)* <sup>=</sup> *h(T (Y ))* <sup>∈</sup> , using the canonical link *h*; this estimation approach will be better motivated in Chap. 3, below. Unfortunately, many examples of the EF do not satisfy this property <sup>T</sup> <sup>⊂</sup> *<sup>M</sup>*. For instance, in the Poisson model the observation *T (Y )* = *Y* = 0 is not included in *M*, see Table 2.1. This poses some challenges in parameter estimation, and the purpose of this small discussion is to be prepared for these challenges.

A cumulant function *<sup>κ</sup>* is called *steep* if for all *<sup>θ</sup>* <sup>∈</sup> ˚ and all \**<sup>θ</sup>* in the boundary of

$$\left(\widetilde{\boldsymbol{\theta}} - \boldsymbol{\theta}\right)^{\top} \nabla\_{\boldsymbol{\theta}} \kappa \left(\boldsymbol{a}\boldsymbol{\theta} + (1 - \boldsymbol{\alpha}) \widetilde{\boldsymbol{\theta}}\right) \quad \to \; \infty \qquad \text{for } \boldsymbol{\alpha} \downarrow 0,\tag{2.20}$$

we refer to Formula (20) in Section 8.1 of Barndorff-Nielsen [23]. Define the convex closure of the support <sup>T</sup> by <sup>C</sup> <sup>=</sup> conv*(*T*)*.

**Theorem 2.19 (Theorem 9.2 in Barndorff-Nielsen [23], Without Proof)** *Assume we have a fixed EF satisfying Assumption 2.6. The cumulant function κ is steep if and only if* ˚<sup>C</sup> <sup>=</sup> *<sup>M</sup>* = ∇*<sup>θ</sup> κ(*˚*).*

Theorem 2.19 tells us that for a steep cumulant function we have <sup>C</sup> <sup>=</sup> *<sup>M</sup>* <sup>=</sup> <sup>∇</sup>*<sup>θ</sup> κ(*˚*)*. In this case parameter estimation can be extended to observations *T (Y )* ∈ *M* such that we may obtain a degenerate model at the boundary of *M*. Coming back to our Poisson example from above, in this case we set *<sup>μ</sup>* <sup>=</sup> 0, which gives a degenerate Poisson model.

Throughout this book we will work under the assumption that *κ* is steep. The classical examples satisfy this assumption: the examples with power variance parameter *p* in {0}∪[1*,*∞*)* satisfy Theorem 2.19; this includes the Gaussian, the Poisson, the gamma, the inverse Gaussian and Tweedie's CP models, see Table 2.1. Moreover, the examples we have met in Sect. 2.1 fulfill this assumption; these are the single-parameter linear EF models of the Bernoulli, the binomial and the negative binomial distributions, as well as the vector-valued parameter examples of the Gaussian, the gamma and the inverse Gaussian models and of the categorical distribution. The only models we have seen that do not have a steep cumulant function are the power variance models with *p <* 0, see Table 2.1.

*Remark 2.20* Working within the EDF needs some additional thoughts because the support <sup>T</sup> <sup>=</sup> <sup>T</sup>*<sup>ω</sup>* of the single-parameter linear EDF random variable *<sup>Y</sup>* <sup>=</sup> *T (Y )* may depend on the specific choice of the dispersion parameter *ω* ∈ *W* ⊃ {1} through the *σ*-finite measure *dνω(ω* ·*)*, see (2.13). For instance, in the binomial case the support of *<sup>Y</sup>* is given by <sup>T</sup>*<sup>ω</sup>* = {0*,* <sup>1</sup>*/n, . . . ,* <sup>1</sup>} with *<sup>ω</sup>* <sup>=</sup> *<sup>n</sup>*, see Sect. 2.2.2.

Assume that the cumulant function *κ* is steep for the single-parameter linear EF that corresponds to the single-parameter EDF with *ω* = 1. Theorem 2.19 then implies that for this choice we have ˚C*ω*=<sup>1</sup> = ∇*<sup>θ</sup> κ(*˚*)* with convex closure <sup>C</sup>*ω*=<sup>1</sup> <sup>=</sup> conv*(*T*ω*=1*)*.

Consider *ω* ∈ *W*\{1} which corresponds to the choice *νω* of the *σ*-finite measure on <sup>R</sup>. This choice belongs to the cumulant function *<sup>θ</sup>* <sup>→</sup> *ωκ(θ )* in the additive form (*x*-parametrization in (2.13)). Since steepness (2.20) holds for any *ω >* 0 we receive that the convex closure of the support of this distribution in the *x*-parametrization in (2.13) is given by <sup>∇</sup>*θωκ(*˚*)* <sup>=</sup> *<sup>ω</sup>*∇*<sup>θ</sup> κ(*˚*)*. The duality transformation *<sup>x</sup>* <sup>→</sup> *<sup>y</sup>* <sup>=</sup> *x/ω* leads to the change of measure *dνω(x)* → *dνω(ωy)* and to the corresponding change of support, see (2.13). The latter implies that in the reproductive form (*y*parametrization) the convex closure of the support does not depend on the specific choice of *ω* ∈ *W*. Since the EDF representation given in (2.14) corresponds to the *y*-parametrization (reproductive form), we can use Theorem 2.19 without limitation also for the single-parameter linear EDF given by (2.14), and C does not depend on *ω* ∈ *W*.

## *2.2.5 Lab: Large Claims Modeling*

From Corollary 2.14 we know that the moment generating function exists around the origin for all examples belonging to the EDF. This implies that the moments of all orders exist, and that we have an exponentially decaying survival function <sup>P</sup>*<sup>θ</sup>* [*Y > y*] = 1 − *F (y*; *θ, ω)* ∼ exp{−*y*} for some  *>* 0 as *y* → ∞, see (1.2). In many applied situations the data is more heavy-tailed and, thus, cannot be modeled by such an exponentially decaying survival function. In such cases one often chooses a distribution function with a regularly varying survival function; regular variation with tail index *β >* 0 has been introduced in (1.3). A popular choice is a log-gamma distribution which can be obtained from the gamma distribution (belonging to the EDF). We briefly explain how this is done and how it relates to the Pareto and the Lomax [256] distributions.

We start from the gamma density (2.6). The random variable *Z* has a log-gamma distribution with shape parameter *α >* 0 and scale parameter *β* = −*θ >* 0 if log*(Z)* = *Y* has a gamma distribution with these parameters. Thus, the gamma density of *Y* = log*(Z)* is given by

$$f(\mathbf{y}; \beta, \alpha)d\mathbf{y} = \frac{\beta^{\alpha}}{\Gamma(\alpha)} \mathbf{y}^{\alpha - 1} \exp\{-\beta \mathbf{y}\} d\mathbf{y} \qquad \text{for } \mathbf{y} > \mathbf{0}.$$

We do a change of variable *y* → *z* = exp{*y*}to receive the density of the log-gamma distributed random variable *Z* = exp{*Y* }

$$f(z; \beta, \alpha)dz = \frac{\beta^{\alpha}}{\Gamma(\alpha)}(\log z)^{\alpha - 1}z^{-(\beta + 1)}dz \qquad \text{ for } z > 1.$$

This log-gamma density has support *(*1*,*∞*)*. The distribution function of this loggamma distributed random variable needs to be calculated numerically, and its survival function is regularly varying with tail index *β >* 0.

A special case of the log-gamma distribution is the Pareto distribution. The Pareto distribution is more tractable and it is obtained by setting shape parameter *α* = 1 in the log-gamma density. This gives us the Pareto density

$$f(z; \beta)dz = f(z; \beta, \alpha = 1)dz = \beta z^{-(\beta + 1)}dz \qquad \text{ for } z > 1. \text{ }$$

The distribution function in this Pareto case is for *z* ≥ 1 given by

$$F(z; \beta) = 1 - z^{-\beta}.$$

Obviously, this provides a regularly varying survival function with tail index *β >* 0; in fact, in this case we do not need to go over to the limit in (1.3) because we have an exact identity. The Pareto distribution has the nice property that it is closed under thresholding (lower-truncation) with *M*, that is, we remain within the family of Pareto distributions with the same tail index *β* by considering lower-truncated claims: for 1 ≤ *M* ≤ *z* we have

$$F(z; \beta, M) = \mathbb{P}\left[Z \le z \mid Z > M\right] = \frac{\mathbb{P}\left[M < Z \le z\right]}{\mathbb{P}\left[Z > M\right]} = 1 - \left(\frac{z}{M}\right)^{-\beta}.$$

This is the classical definition of the Pareto distribution, and it allows to preserve full flexibility in the choice of the threshold *M >* 0.

The disadvantage of the Pareto distribution is that it does not provide a continuous density on <sup>R</sup><sup>+</sup> as there is a discontinuity in threshold *<sup>M</sup>*. For this reason, one sometimes explores another change of variable *Z* → *X* = *Z* − *M* for a Pareto distributed random variable *Z* ∼ *F (*·; *β,M)*. This provides the Lomax distribution, also called Pareto Type II distribution. *X* has the following distribution function on *(*0*,*∞*)*

$$\mathbb{P}\left[X \le x\right] = 1 - \left(\frac{x+M}{M}\right)^{-\beta} \qquad \text{for } x \ge 0.$$

This distribution has again a regularly varying survival function with tail index *β >* 0. Moreover, we have

$$\lim\_{\chi \to \infty} \frac{\left(\frac{\chi + M}{M}\right)^{-\beta}}{\left(\frac{\chi}{M}\right)^{-\beta}} = \lim\_{\chi \to \infty} \left(1 + \frac{M}{\chi}\right)^{-\beta} = 1.$$

This says that we should choose the same threshold *M >* 0 for both the Pareto and the Lomax distribution to receive the same asymptotic tail behavior, and this also quantifies the rate of convergence between the two survival functions. Figure 2.3 illustrates this convergence in a log-log plot choosing tail index *β* = 2 and threshold *M* = 1 000 000.

For completeness we provide the density of the Pareto distribution

$$f(z; \beta, M) = \frac{\beta}{M} \left(\frac{z}{M}\right)^{-(\beta+1)} \qquad \text{for } z \ge M,$$

and of the Lomax distribution

$$f(\mathbf{x}; \beta, M) = \frac{\beta}{M} \left(\frac{\mathbf{x} + M}{M}\right)^{-(\beta + 1)} \qquad \text{for } \mathbf{x} \ge \mathbf{0}.$$

## **2.3 Information Geometry in Exponential Families**

We do a short excursion to information geometry. This excursion may look a bit disconnected from what we have done so far, but it provides us with important background information for the chapter on forecast evaluation, see Chap. 4, below.

## *2.3.1 Kullback–Leibler Divergence*

There is literature in information geometry which uses techniques from differential geometry to study EFs as Riemannian manifolds with points corresponding to EF densities parametrized by their canonical parameters *θ* ∈ , we refer to Amari [10], Ay et al. [16] and Nielsen [285] for an extended treatment of these mathematical concepts.

Choose a fixed EF (2.2) with cumulant function *κ* on the effective domain <sup>⊆</sup> <sup>R</sup>*<sup>k</sup>* and with *<sup>σ</sup>*-finite measure *<sup>ν</sup>* on <sup>R</sup>. We define the Kullback–Leibler (KL) divergence (relative entropy) from model *θ* <sup>1</sup> ∈ to model *θ* <sup>0</sup> ∈ within this EF by

$$D\_{\mathrm{KL}}(f(\cdot;\theta\_0)||f(\cdot;\theta\_1)) = \int\_{\mathbb{R}} f(\mathbf{y};\theta\_0) \log \left(\frac{f(\mathbf{y};\theta\_0)}{f(\mathbf{y};\theta\_1)}\right) d\boldsymbol{\nu}(\mathbf{y}) \ge 0.$$

Recall that the support of the EF does not depend on the specific choice of the canonical parameter *θ* in , see Remarks 2.3; this implies that the KL divergence is well-defined, here. The positivity of the KL divergence is obtained from Jensen's inequality; this is proved in Lemma 2.21, below.

The KL divergence has the interpretation of having a data model that is characterized by the distribution *f (*·; *θ*0*)*, and we would like to measure how close another model *f (*·; *θ*1*)* is to the data model. Note that the KL divergence is not a distance function because it is neither symmetric nor does it satisfy the triangle inequality.

We calculate the KL divergence within the chosen EF

$$D\_{\rm KL}(f(\cdot;\theta\_0)||f(\cdot;\theta\_1)) = \int\_{\mathbb{R}} f(\mathbf{y};\theta\_0) \left[ (\theta\_0 - \theta\_1)^\top T(\mathbf{y}) - \kappa(\theta\_0) + \kappa(\theta\_1) \right] d\nu(\mathbf{y})$$

$$= (\theta\_0 - \theta\_1)^\top \nabla\_{\theta} \kappa(\theta\_0) - \kappa(\theta\_0) + \kappa(\theta\_1) \ge 0,\qquad(2.21)$$

where we have used Corollary 2.5, and the positivity of the KL divergence can be seen from the convexity of *κ*. This allows us to consider the following (Taylor) expansion

$$\kappa(\boldsymbol{\theta}\_{1}) = \kappa(\boldsymbol{\theta}\_{0}) + \nabla\_{\boldsymbol{\theta}}\kappa(\boldsymbol{\theta}\_{0})^{\top}(\boldsymbol{\theta}\_{1} - \boldsymbol{\theta}\_{0}) + D\_{\text{KL}}(f(\cdot;\boldsymbol{\theta}\_{0})||f(\cdot;\boldsymbol{\theta}\_{1})).\tag{2.22}$$

This illustrates that the KL divergence corresponds to second and higher order differences between the cumulant value *κ(θ* <sup>0</sup>*)* and another cumulant value *κ(θ*1*)*. The gradients of the KL divergence w.r.t. *θ* <sup>1</sup> in *θ* <sup>1</sup> = *θ* <sup>0</sup> and w.r.t. *θ*<sup>0</sup> in *θ* <sup>0</sup> = *θ* <sup>1</sup> are given by

$$\left. \nabla\_{\theta\_1} D\_{\text{KL}}(f(\cdot; \theta\_0) || f(\cdot; \theta\_1)) \right|\_{\theta\_1 = \theta\_0} \tag{2.23}$$

$$= \left. \nabla\_{\theta\_0} D\_{\text{KL}}(f(\cdot; \theta\_0) || f(\cdot; \theta\_1)) \right|\_{\theta\_0 = \theta\_1} = \left. \mathbf{0}.$$

This emphasizes that the KL divergence reflects second and higher-order terms in cumulant function *κ*; and that the data model *θ* <sup>0</sup> forms the minimum of this KL divergence (as a function of *θ*1) as we will just see. We calculate the Hessian (second order term) w.r.t. *θ* <sup>1</sup> in *θ* <sup>1</sup> = *θ* <sup>0</sup>

$$\nabla^2\_{\theta\_1} D\_{\text{KL}}(f(\cdot;\theta\_0)||f(\cdot;\theta\_1))\Big|\_{\theta\_1=\theta\_0} = \nabla^2\_{\theta} \kappa(\theta)\Big|\_{\theta=\theta\_0} \stackrel{\text{def.}}{=} \mathcal{L}(\theta\_0).$$

The positive definite matrix *I(θ* <sup>0</sup>*)* (in a minimal representation) is called *Fisher's information*. Fisher's information is an important tool in statistics that we will meet in Theorem 3.13 of Sect. 3.3, below. A function satisfying (2.21) (with being zero if and only if *θ* <sup>0</sup> = *θ* 1), fulfilling (2.23) and having positive definite Fisher's information is called *divergence*, see Definition 5 in Nielsen [285]. Fisher's information *I(θ* <sup>0</sup>*)* measures the curvature of the KL divergence in *θ* <sup>0</sup> and we have the second order Taylor approximation

$$\kappa(\boldsymbol{\theta}\_{1}) \approx \kappa(\boldsymbol{\theta}\_{0}) + \nabla\_{\boldsymbol{\theta}} \kappa(\boldsymbol{\theta}\_{0})^{\top} \left(\boldsymbol{\theta}\_{1} - \boldsymbol{\theta}\_{0}\right) + \frac{1}{2} \left(\boldsymbol{\theta}\_{1} - \boldsymbol{\theta}\_{0}\right)^{\top} \mathcal{I}(\boldsymbol{\theta}\_{0}) \left(\boldsymbol{\theta}\_{1} - \boldsymbol{\theta}\_{0}\right) \dots$$

Next-order terms are obtained from the so-called Amari–Chentsov tensor, see Amari [10] and Section 4.2 in Ay et al. [16]. In information geometry one studies the (possibly degenerate) Riemannian metric on the effective domain induced by Fisher's information; we refer to Section 3.7 in Nielsen [285].

**Lemma 2.21** *Consider two densities p and q w.r.t. a given σ-finite measure ν. We have DKL(p*||*q)* ≥ 0*, and DKL(p*||*q)* = 0 *if and only if p* = *q, ν-a.s.*

*Proof* Assume *Y* ∼ *pdν*, then we can rewrite the KL divergence, using Jensen's inequality,

$$D\_{\rm KL}(p||q) = \int p(\mathbf{y}) \log \left(\frac{p(\mathbf{y})}{q(\mathbf{y})}\right) d\boldsymbol{\nu}(\mathbf{y}) = -\mathbb{E}\_p\left[\log \left(\frac{q(\mathbf{y})}{p(\mathbf{y})}\right)\right]$$

$$\geq -\log \mathbb{E}\_p\left[\frac{q(\mathbf{y})}{p(\mathbf{y})}\right] = -\log \int q(\mathbf{y}) d\boldsymbol{\nu}(\mathbf{y}) \geq 0. \tag{2.24}$$

Equality holds if and only if *p* = *q*, *ν*-a.s. The last inequality of (2.24) considers that *q* does not necessarily need to be a density w.r.t. *ν*, i.e., we can also have - *q(y)dν(y) <* 1.

## *2.3.2 Unit Deviance and Bregman Divergence*

In the next chapter we are going to introduce maximum likelihood estimation for parameters, see Definition 3.4, below. Maximum likelihood estimators are obtained by maximizing likelihood functions (evaluated in the observations). Maximizing likelihood functions within the EDF is equivalent to minimizing deviance loss functions. Deviance loss functions are based on unit deviances, which, in turn, correspond to KL divergences. The purpose of this small section is to discuss this relation. This should be viewed as a preparation for Chap. 4.

Assume we work within a single-parameter linear EDF, i.e., *T (y)* = *y*. Using the canonical link *<sup>h</sup>* we obtain the canonical parameter *<sup>θ</sup>* <sup>=</sup> *h(μ)* <sup>∈</sup> <sup>⊆</sup> <sup>R</sup> from the mean parameter *μ* ∈ *M*. If we replace the (typically unknown) mean parameter *μ* by an observation *Y* , supposed *Y* ∈ *M*, we get the specific model that is exactly calibrated to this observation. This provides us with the canonical parameter estimate *θY* <sup>=</sup> *h(Y )* for *<sup>θ</sup>*. We can now measure the KL divergence from any model represented by *<sup>θ</sup>* to the observation calibrated model *θY* <sup>=</sup> *h(Y )*. This KL divergence is given by (we use (2.21) and we set *ω* = *v/ϕ* = 1)

$$\begin{aligned} D\_{\mathrm{KL}}\left(f\left(\cdot;h(Y),1\right)|\,|f\left(\cdot;\theta,1\right)\right) &= \int\_{\mathbb{R}} f\left(\mathbf{y};\widehat{\theta}\_{Y},1\right) \log\left(\frac{f\left(\mathbf{y};\widehat{\theta}\_{Y},1\right)}{f\left(\mathbf{y};\theta,1\right)}\right) d\nu(\mathbf{y}) \\ &= \left(h(Y)-\theta\right)Y - \kappa\left(h(Y)\right) + \kappa(\theta) \ge 0. \end{aligned}$$

This latter object is the unit deviance (up to factor 2) of the chosen EDF. It plays a crucial role in predictive modeling.

We define the *unit deviance* under the assumption that *κ* is steep as follows: <sup>d</sup> : ˚<sup>C</sup> <sup>×</sup> *<sup>M</sup>* <sup>→</sup> <sup>R</sup><sup>+</sup> (2.25) *(y, μ)* <sup>→</sup> <sup>d</sup>*(y, μ)* <sup>=</sup> <sup>2</sup> *yh(y)* − *κ (h(y))* − *yh(μ)* + *κ (h(μ))* ≥ 0*,*

where <sup>C</sup> is the convex closure of the support <sup>T</sup> of *<sup>Y</sup>* and *<sup>M</sup>* is the dual parameter space of the chosen EDF. Steepness of *<sup>κ</sup>* implies ˚<sup>C</sup> <sup>=</sup> *<sup>M</sup>*, see Theorem 2.19.

This unit deviance d is received from the KL divergence, and it is (twice) the difference of two log-likelihood functions, one using canonical parameter *h(y)* and the other one having any canonical parameter *<sup>θ</sup>* <sup>=</sup> *h(μ)* <sup>∈</sup> ˚. That is, for *<sup>μ</sup>* <sup>=</sup> *<sup>κ</sup> (θ )*,

$$\mathfrak{d}(\mathbf{y},\mu) = 2 \, D\_{\text{KL}}(f(\cdot; h(\mathbf{y}), 1) || f(\cdot; \theta, 1)) \tag{2.26}$$

$$= 2 \, \frac{\varphi}{v} \left( \log f(\mathbf{y}; h(\mathbf{y}), v/\rho) - \log f(\mathbf{y}; \theta, v/\rho) \right),$$

for general *ω* = *v/ϕ* ∈ *W*. The latter can be rewritten as

$$f(\mathbf{y}; \theta, \mathbf{v}/\varphi) = f(\mathbf{y}; h(\mathbf{y}), \mathbf{v}/\varphi) \cdot \exp\left\{-\frac{1}{2\varphi/v} \mathfrak{d}(\mathbf{y}, \kappa'(\theta))\right\}.\tag{2.27}$$

This looks like a generalization of the Gaussian distribution, where the square difference *(y* <sup>−</sup> *μ)*<sup>2</sup> in the exponent is replaced by the unit deviance <sup>d</sup>*(y, μ)* with *μ* = *κ (θ )*. This interpretation gets further support by the following lemma.

**Lemma 2.22** *Under Assumption 2.6 and the assumption that the cumulant function <sup>κ</sup> is steep, the unit deviance* <sup>d</sup> *(y,μ)* <sup>≥</sup> <sup>0</sup> *of the chosen EDF is zero if and only if <sup>y</sup>* <sup>=</sup> *<sup>μ</sup>. Moreover, the unit deviance* <sup>d</sup> *(y,μ) is twice continuously differentiable w.r.t. (y, μ) in* ˚<sup>C</sup> <sup>×</sup> *<sup>M</sup>, and*

$$\left. \frac{\partial^2 \mathfrak{d}(\mathbf{y}, \mu)}{\partial \mu^2} \right|\_{\mathbf{y} = \mu} = \left. \frac{\partial^2 \mathfrak{d}(\mathbf{y}, \mu)}{\partial \mathbf{y}^2} \right|\_{\mathbf{y} = \mu} = -\left. \frac{\partial^2 \mathfrak{d}(\mathbf{y}, \mu)}{\partial \mu \partial \mathbf{y}} \right|\_{\mathbf{y} = \mu} = 2/V(\mu) > 0.$$

*Proof* The positivity and the if and only if statement follows from Lemma 2.21 and the strict convexity of *κ*. Continuous differentiability follows from the smoothness of *κ* in the interior of . Moreover we have

$$\left. \frac{\partial^2 \mathfrak{d}\left(\mathbf{y}, \mu\right)}{\partial \mu^2} \right|\_{\mathbf{y} = \mu} = \left. \frac{\partial}{\partial \mu} 2 \left( -\mathbf{y} h'(\mu) + \mu h'(\mu) \right) \right|\_{\mathbf{y} = \mu} = 2h'(\mu) = 2/\kappa''(h(\mu)) = 2/V(\mu) > 0,$$

where *V (μ)* is the variance function of the chosen EDF introduced in Corollary 2.14. The remaining second derivatives are received by similar (straightforward) calculations.

#### *Remarks 2.23*


More generally, the KL divergence and the unit deviance can be embedded into the framework of Bregman loss functions [50]. We restrict to the single-parameter EDF case. Assume that *<sup>ψ</sup>* : ˚<sup>C</sup> <sup>→</sup> <sup>R</sup> is a strictly convex function. The *Bregman divergence* w.r.t. *ψ* between *y* and *μ* is defined by

$$D\_{\boldsymbol{\Psi}}(\mathbf{y}, \boldsymbol{\mu}) = \boldsymbol{\Psi}(\mathbf{y}) - \boldsymbol{\Psi}(\boldsymbol{\mu}) - \boldsymbol{\Psi}'(\boldsymbol{\mu}) \, (\mathbf{y} - \boldsymbol{\mu}) \, \tag{2.28} \quad (2.28)$$

where *ψ* is a (sub-)gradient of *ψ*. The lower bound holds because of convexity of *ψ*. Consider the specific choice *ψ(μ)* = *μh(μ)* − *κ(h(μ))* for the chosen EDF. Similar to Lemma 2.22 we have *ψ(μ)* = *h (μ)* = 1*/V (μ) >* 0, which says that this choice is strictly convex. Using this choice for *ψ* gives us unit deviance (up to factor 1*/*2)

$$D\_{\boldsymbol{\Psi}}(\mathbf{y}, \boldsymbol{\mu}) = \mathbf{y}h(\mathbf{y}) - \kappa(h(\mathbf{y})) + \kappa(h(\boldsymbol{\mu})) - h(\boldsymbol{\mu})\mathbf{y} = \frac{1}{2}\mathbf{\hat{d}}(\mathbf{y}, \boldsymbol{\mu}).\tag{2.29}$$

Thus, the unit deviance d can be understood as a difference of log-likelihoods (2.26), as a KL divergence *D*KL and as a Bregman divergence *Dψ* .

*Example 2.24 (Poisson Model)* We start with a single-parameter EF example. Consider cumulant function *κ(θ )* <sup>=</sup> exp{*θ*} for canonical parameter *<sup>θ</sup>* <sup>∈</sup> <sup>=</sup> <sup>R</sup>, this gives us the Poisson model. For the KL divergence from model *θ*<sup>1</sup> to model *θ*<sup>0</sup> we receive

$$D\_{\mathrm{KL}}(f(\cdot;\theta\_0)||f(\cdot;\theta\_1)) = \exp\{\theta\_1\} - \exp\{\theta\_0\} - (\theta\_1 - \theta\_0)\exp\{\theta\_0\} \ge 0,$$

which is zero if and only if *θ*<sup>0</sup> = *θ*1. Fisher's information is given by

$$\mathcal{T}(\theta) = \kappa''(\theta) = \exp\{\theta\} > 0.1$$

If we have observation *Y >* 0 we receive a model described by canonical parameter *θY* <sup>=</sup> *h(Y )* <sup>=</sup> log*(Y )*. This gives us unit deviance, see (2.26),

$$\mathfrak{d}(Y,\mu) = 2D\_{\text{KL}}(f(\cdot; h(Y), 1) || f(\cdot; \theta, 1))$$

$$= 2\left(e^{\theta} - Y - (\theta - \log(Y))Y\right)$$

$$= 2\left(\mu - Y - Y\log\left(\frac{\mu}{Y}\right)\right) \ge 0,$$

with *μ* = *κ (θ )* = exp{*θ*}. This Poisson unit deviance will commonly be used for model fitting and forecast evaluation, see, e.g., (5.28). -

*Example 2.25 (Gamma Model)* The second example considers a vector-valued parameter EF example. We consider the cumulant function *κ(θ)* = log*(θ*2*)* − *θ*2log*(*−*θ*1*)* for *θ* = *(θ*1*, θ*2*)*- ∈ = *(*−∞*,* 0*)* × *(*0*,*∞*)*; this gives us the gamma model, see Sect. 2.1.3. For the KL divergence from model *θ* <sup>1</sup> to model *θ* <sup>0</sup> we receive

$$\begin{split} D\_{\text{KL}}(f(\cdot;\theta\_{0})||f(\cdot;\theta\_{1})) &= \left(\theta\_{0,2} - \theta\_{1,2}\right) \frac{\Gamma'(\theta\_{0,2})}{\Gamma(\theta\_{0,2})} - \log\left(\frac{\Gamma(\theta\_{0,2})}{\Gamma(\theta\_{1,2})}\right) \\ &+ \theta\_{1,2} \log\left(\frac{-\theta\_{0,1}}{-\theta\_{1,1}}\right) + \theta\_{0,2} \left(\frac{-\theta\_{1,1}}{-\theta\_{0,1}} - 1\right) \geq \ 0. \end{split}$$

Fisher's information matrix is given by

$$\mathcal{I}(\theta) = \nabla\_{\theta}^{2} \kappa(\theta) = \begin{pmatrix} \frac{\theta\_{2}}{(-\theta\_{1})^{2}} & \frac{1}{-\theta\_{1}}\\ \frac{1}{-\theta\_{1}} & \frac{\Gamma''(\theta\_{2})\Gamma(\theta\_{2}) - \Gamma'(\theta\_{2})^{2}}{\Gamma(\theta\_{2})^{2}} \end{pmatrix}.$$


The off-diagonal terms in Fisher's information matrix *I(θ)* are non-zero which means that the two components of the canonical parameter *θ* interact. Choosing a different parametrization *μ* = *θ*2*/(*−*θ*1*)* (dual mean parametrization) and *α* = *θ*<sup>2</sup> we receive diagonal Fisher's information in *(μ, α)*

$$\mathcal{Z}(\mu,\alpha) = \begin{pmatrix} \frac{\alpha}{\mu^2} & 0\\ 0 & \frac{\Gamma''(\alpha)\Gamma(\alpha) - \Gamma'(\alpha)^2}{\Gamma(\alpha)^2} - \frac{1}{\alpha} \end{pmatrix} = \begin{pmatrix} \frac{\alpha}{\mu^2} & 0\\ 0 & \Psi'(\alpha) - \frac{1}{\alpha} \end{pmatrix},\tag{2.30}$$

where is the digamma function, see Footnote 2 on page 22. This transformation is obtained by using the corresponding Jacobian matrix for variable transformation; more details are provided in (3.16) below. In this new representation, the parameters *μ* and *α* are orthogonal; the term *(α)* <sup>−</sup> <sup>1</sup> *<sup>α</sup>* is further discussed in Remarks 5.26 and Remarks 5.28, below.

Using this second parametrization based on mean *μ* and dispersion 1*/α*, we arrive at the EDF representation of the gamma model. This allows us to calculate the corresponding unit deviance (within the EDF), which in the gamma case is given by

$$\mathfrak{d}(Y,\mu) = 2\left(\frac{Y}{\mu} - 1 + \log\left(\frac{\mu}{Y}\right)\right) \ge 0.1$$

*Example 2.26 (Inverse Gaussian Model)* Our final example considers the inverse Gaussian vector-valued parameter EF case. We consider the cumulant function *κ(θ)* = −2*(θ*1*θ*2*)*1*/*<sup>2</sup> <sup>−</sup> <sup>1</sup> <sup>2</sup> log*(*−2*θ*2*)* for *θ* = *(θ*1*, θ*2*)*- ∈ = *(*−∞*,* 0] × *(*−∞*,* 0*)*, see Sect. 2.1.3. For the KL divergence from model *θ* <sup>1</sup> to model *θ* <sup>0</sup> we receive

$$\begin{split} D\_{\mathrm{KL}}(f(\cdot;\theta\_{0})||f(\cdot;\theta\_{1})) &= -\theta\_{1,1}\sqrt{\frac{-\theta\_{0,2}}{-\theta\_{0,1}}} - \theta\_{1,2}\sqrt{\frac{-\theta\_{0,1}}{-\theta\_{0,2}}} - 2\sqrt{\theta\_{1,1}\theta\_{1,2}} \\ &+ \frac{\theta\_{0,2} - \theta\_{1,2}}{-2\theta\_{0,2}} + \frac{1}{2}\log\left(\frac{-\theta\_{0,2}}{-\theta\_{1,2}}\right) \geq \ 0. \end{split}$$

Fisher's information matrix is given by

$$\mathcal{Z}(\boldsymbol{\theta}) = \nabla\_{\boldsymbol{\theta}}^2 \kappa(\boldsymbol{\theta}) = \begin{pmatrix} \frac{(-2\theta\_2)^{1/2}}{(-2\theta\_1)^{3/2}} & -\frac{1}{2(\theta\_1\theta\_2)^{1/2}}\\ -\frac{1}{2(\theta\_1\theta\_2)^{1/2}} \frac{(-2\theta\_1)^{1/2}}{(-2\theta\_2)^{3/2}} + \frac{2}{(-2\theta\_2)^2} \end{pmatrix}.$$

Again the off-diagonal terms in Fisher's information matrix *I(θ )* are non-zero in the canonical parametrization. We switch to the mean parametrization by setting *<sup>μ</sup>* <sup>=</sup> *(*−2*θ*2*/(*−2*θ*1*))*1*/*<sup>2</sup> and *<sup>α</sup>* = −2*θ*2. This provides us with diagonal Fisher's information

$$\mathcal{T}(\mu,\alpha) = \begin{pmatrix} \frac{\alpha}{\mu^{\mathfrak{J}}} & 0\\ 0 & \frac{1}{2\alpha^{2}} \end{pmatrix}. \tag{2.31}$$

This transformation is again obtained by using the corresponding Jacobian matrix for variable transformation, see (3.16), below. We compare the lower-right entries of (2.30) and (2.31). Remark that we have first order approximation of the digamma function

$$
\Psi(\alpha) \approx \log \alpha - \frac{1}{2\alpha},
$$

and taking derivatives says that these entries of Fisher's information are first order equivalent; this is also used in the saddlepoint approximation in Sect. 5.5.2, below. Using this second parametrization based on mean *μ* and dispersion 1*/α*, we arrive at the EDF representation of the inverse Gaussian model with unit deviance

$$\mathfrak{d}(Y,\mu) = \frac{(Y-\mu)^2}{\mu^2 Y} \ge 0.$$

More examples will be given in Chap. 4, below.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.


# **Chapter 3 Estimation Theory**

This chapter gives an introduction to decision and estimation theory. This introduction is based on the books of Lehmann [243, 244], the lecture notes of Künsch [229] and the book of Van der Vaart [363]. This chapter presents classical statistical estimation theory, it embeds estimation into a historical context, and it provides important aspects and intuition for modern data science and predictive modeling. For further reading we recommend the books of Barndorff-Nielsen [23], Berger [31], Bickel–Doksum [33] and Efron–Hastie [117].

## **3.1 Introduction to Decision Theory**

We start from an observation vector *Y* = *(Y*1*,...,Yn)* taking values in a measurable space <sup>Y</sup> <sup>⊂</sup> <sup>R</sup>*n*, where *<sup>n</sup>* <sup>∈</sup> <sup>N</sup> denotes the number of components *Yi*, 1 ≤ *i* ≤ *n*, in *Y*. Assume that this observation vector *Y* has been generated by a distribution belonging to the family *P* = {*P (*·; *θ )*; *θ* ∈ } being parametrized by a parameter set .

*Remarks 3.1* There are some subtle points in the notation that we are going to use. We use *P (*·; *θ )* for the distribution of the observation vector *Y*, and if we consider a specific component *Yi* of *Y* we will use the notation *Yi* ∼ *F (*·; *θ )*. We make this distinction as in estimation theory one often considers i.i.d. observations *Yi* ∼ *F (*·; *θ )*, 1 ≤ *i* ≤ *n*, with (in this case) joint product distribution *Y* ∼ *P (*·; *θ )*. This latter distribution is then used for purposes of maximum likelihood estimation, etc. The family *P* is parametrized by *θ* ∈ , and if we want to emphasize that this parameter is a *k*-dimensional vector we use boldface notation *θ*, this is similar to the EFs introduced in Chap. 2, but in this chapter we do not restrict to EFs. Finally, we assume identifiability meaning that different parameters *θ* give different distributions *P (*·; *θ )* ∈ *P*.

To fix ideas, assume we want to determine *γ (θ )* of a given functional *γ (*·*)* on . Typically, the true value *θ* ∈ is not known, and we are not able to determine *γ (θ )* explicitly. Therefore, we try to *estimate γ (θ )* from data *Y* ∼ *P (*·; *θ )* that belongs to the same *θ* ∈ . As an example we may think of working in the EDF of Chap. 2, and we are interested in the mean *<sup>μ</sup>* <sup>=</sup> <sup>E</sup>*<sup>θ</sup>* [*<sup>Y</sup>* ] = *<sup>κ</sup> (θ )* of *Y* . Thus, we aim at determining *γ (θ )* = *κ (θ )*. If the true *θ* is unknown, and if we have an observation *Y* from this model, we can try to estimate *γ (θ )* = *κ (θ )* from *Y* . This motivation is based on estimation of *γ (θ )*, but the following framework of decision making is more general, for instance, it may also be used for statistical hypothesis testing.

Denote the *action space* of possible decisions (actions) by A. In decision theory we are looking for a *decision rule* (*action rule*)

$$A: \mathbb{Y} \to \mathbb{A}, \qquad \mathbf{Y} \mapsto A(\mathbf{Y}), \tag{3.1}$$

which should be understood as an educated guess for *γ (θ )* based on observation *Y*. A decision rule is evaluated in terms of a (given) *loss function*

$$L: \Theta \times \mathbb{A} \to \mathbb{R}\_+, \qquad (\theta, a) \mapsto L(\theta, a) \ge 0. \tag{3.2}$$

*L(θ , a)* describes the loss of an action *<sup>a</sup>* <sup>∈</sup> <sup>A</sup> w.r.t. a true parameter choice *<sup>θ</sup>* <sup>∈</sup> . The *risk function* of decision rule *A* for data generated by *Y* ∼ *P (*·; *θ )* is defined by

$$\theta \mapsto \mathcal{R}(\theta, A) = \mathbb{E}\_{\theta} [L(\theta, A(\mathbf{Y}))] = \int\_{\mathcal{Y}} L\left(\theta, A(\mathbf{y})\right) dP(\mathbf{y}; \theta), \tag{3.3}$$

where <sup>E</sup>*<sup>θ</sup>* is the expectation w.r.t. the probability distribution *P (*·; *θ )*. Risk function (3.3) describes the long-term average loss of using decision rule *A*. As an example we may think of estimating *γ (θ )* for unknown (true) parameter *θ* by a decision rule *Y* → *A(Y)*. Then, the loss function *L(θ , A(Y))* should describe the *estimation loss* if we consider the discrepancy between *γ (θ )* and its estimate *A(Y)*, and the risk function *R(θ , A)* is the *average estimation loss* in that case.

Good decision rules *A* should provide a small risk *R(θ , A)*. Unfortunately, this statement is of rather theoretical nature because, in general, the true data generating parameter *θ* is not known and the goodness of a decision rule for the true parameter cannot be evaluated explicitly, but the risk can only be estimated (for instance, using a bootstrap approach). Moreover, typically, there does not exist a uniformly best decision rule *A* over all *θ* ∈ . For these reasons we may (just) try to eliminate decision rules that are obviously not good. We give two introductory examples.

*Example 3.2 (Minimax Decision Rule)* Decision rule *A* is called minimax if for all alternative decision rules *A* \* : <sup>Y</sup> <sup>→</sup> <sup>A</sup> we have

$$\sup\_{\theta \in \Theta} \mathcal{R}(\theta, A) \le \sup\_{\theta \in \Theta} \mathcal{R}(\theta, \dot{A}).$$

A minimax decision rule is the best choice in the worst case of the true *θ*, i.e., it minimizes the worst case risk. -

*Example 3.3 (Bayesian Decision Rule)* Assume we are given a distribution *π* on . Decision rule *A* is called Bayesian w.r.t. *π* if it satisfies

$$A := \operatorname\*{arg\,min}\_{\tilde{A}} \int\_{\Theta} \mathcal{R}(\theta, \tilde{A}) d\pi(\theta).$$

Distribution *π* is called *prior distribution* on . -

The above examples give two possible choices of decision rules. The first one tries to minimize the worst case risk, whereas the second one uses additional knowledge in terms of a prior distribution *π* on . This means that we impose stronger assumptions in the second case to get stronger conclusions. The difficult part in practice is to justify these stronger assumptions in order to validate the stronger conclusions. Below, we are going to introduce other criteria that should be satisfied by good decision rules, an important one in estimation will be unbiasedness.

## **3.2 Parameter Estimation**

This section focuses on estimating the (unknown) parameter *θ* ∈ from observation *<sup>Y</sup>* <sup>∼</sup> *P (*·; *θ )*. For this we consider decision rules *<sup>A</sup>* : <sup>Y</sup> <sup>→</sup> <sup>A</sup> <sup>=</sup> with *A(Y)* estimating *θ*. We assume there exist densities *p(*·; *θ )* w.r.t. a fixed *σ*-finite measure *<sup>ν</sup>* on <sup>Y</sup> <sup>⊂</sup> <sup>R</sup>*n*,

$$dP(\mathbf{y}; \theta) = p(\mathbf{y}; \theta)d\nu(\mathbf{y}),$$

for all distributions *P (*·; *θ )* ∈ *P*, i.e., all *θ* ∈ .

**Definition 3.4 (Maximum Likelihood Estimator, MLE)** The maximum likelihood estimator (MLE) of *<sup>θ</sup>* for a given observation *<sup>Y</sup>* <sup>∈</sup> <sup>Y</sup> is given by (subject to existence and uniqueness)

$$
\widehat{\theta}^{\mathsf{MLE}} = \operatorname\*{arg\,max}\_{\widetilde{\theta} \in \Theta} p(\boldsymbol{Y}; \widetilde{\boldsymbol{\theta}}) = \operatorname\*{arg\,max}\_{\widetilde{\theta} \in \Theta} \ell\_{\boldsymbol{Y}}(\widetilde{\boldsymbol{\theta}}),
$$

where the log-likelihood function of *p(Y* ; *θ )* is defined by *θ* → *<sup>Y</sup> (θ )* = log *p(Y*; *θ )*.

The MLE *<sup>Y</sup>* <sup>→</sup> *<sup>θ</sup>*MLE <sup>=</sup> *<sup>θ</sup>*MLE*(Y)* <sup>=</sup> *A(Y)* is nothing else than a specific decision rule with action space <sup>A</sup> <sup>=</sup> for estimating *<sup>θ</sup>*. We can now start to explore the risk function *<sup>R</sup>(θ , <sup>θ</sup>*MLE*)* of that decision rule for a given loss function *<sup>L</sup>*.

*Example 3.5 (MLE within the EDF)* We emphasize that this example is used throughout these notes. Assume that the (independent) components of *Y* = *(Y*1*,...,Yn)*- ∼ *P (*·; *θ )* follow a given EDF distribution. That is, we assume that *Y*1*,...,Yn* are independent and have densities w.r.t. *σ*-finite measures on R given by, see (2.14),

$$Y\_l \sim f(\mathbf{y}\_l; \theta, v\_l/\varphi) = \exp\left\{\frac{\mathbf{y}\_l \theta - \kappa(\theta)}{\varphi/v\_l} + a(\mathbf{y}\_l; v\_l/\varphi)\right\},$$

for 1 ≤ *i* ≤ *n*. Note that these random variables are not i.i.d. because they may differ in exposures *vi >* 0. Throughout, we assume that Assumption 2.6 is fulfilled and that the cumulant function *κ* is steep, see Theorem 2.19. For the latter we also refer to Remark 2.20: the supports T*vi/ϕ* of *Yi* may differ; however, these supports share the same convex closure.

Independence between the *Yi*'s implies that the joint probability *P (*·; *θ )* is the product distribution of the individual distributions *F (*·; *θ,vi/ϕ)*, 1 ≤ *i* ≤ *n*. Therefore, the MLE of *θ* in the EDF is found by solving

$$\|\widehat{\theta}^{\text{MLE}}\| = \operatorname\*{arg\,max}\_{\substack{\widetilde{\theta} \in \Theta}} \ell\_Y(\widetilde{\theta}) = \operatorname\*{arg\,max}\_{\substack{\widetilde{\theta} \in \Theta}} \sum\_{l=1}^n \frac{Y\_l \widetilde{\theta} - \kappa(\widetilde{\theta})}{\varphi/v\_l}.$$

Since the cumulant function *κ* is strictly convex we receive the MLE (subject to existence)

$$\widehat{\theta}^{\text{MLE}} = \widehat{\theta}^{\text{MLE}}(Y) = (\kappa')^{-1} \left( \frac{\sum\_{l=1}^{n} v\_l Y\_l}{\sum\_{l=1}^{n} v\_l} \right) = h \left( \frac{\sum\_{l=1}^{n} v\_l Y\_l}{\sum\_{l=1}^{n} v\_l} \right).$$

Thus, the MLE is received by applying the canonical link *h* = *(κ )*−1, see Definition 2.8, and strict convexity of *κ* implies that the MLE is unique. However, existence needs to be analyzed more carefully! It may happen that the MLE *<sup>θ</sup>*MLE is a boundary point of the effective domain which may not exist (if is open). We give an example. Assume we work in the Poisson model presented in Sect. 2.1.2. The canonical link in the Poisson model is the log-link *μ* → *h(μ)* = log*(μ)*, for *μ >* 0. With positive probability we have in the Poisson case *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *viYi* <sup>=</sup> 0. Therefore, with positive probability the MLE *<sup>θ</sup>*MLE does not exist (we have a degenerate Poisson model in that case).

Since the canonical link is strictly increasing we can also perform MLE in the dual (mean) parametrization. The dual parameter space is given by *M* = *κ (*˚*)*, see Remarks 2.9, with mean parameters *μ* = *κ (θ )* ∈ *M*. This motivates

$$\widehat{\mu}^{\text{MLE}} = \underset{\widetilde{\mu} \in \mathcal{M}}{\text{arg}\max} \,\ell\_Y(h(\widetilde{\mu})) \, = \underset{\widetilde{\mu} \in \mathcal{M}}{\text{arg}\max} \,\sum\_{l=1}^n \frac{Y\_l h(\widetilde{\mu}) - \kappa(h(\widetilde{\mu}))}{\varphi/v\_l}. \tag{3.4}$$

Subject to existence, this provides the unique MLE

$$
\widehat{\mu}^{\text{MLE}} = \widehat{\mu}^{\text{MLE}}(Y) = \frac{\sum\_{l=1}^{n} v\_l Y\_l}{\sum\_{l=1}^{n} v\_l}. \tag{3.5}
$$

Also this dual MLE does not need to exist (in the dual parameter space *M*). Under the assumption that the cumulant function *κ* is steep, we know that the closure of the dual parameter space *<sup>M</sup>* contains the supports <sup>T</sup>*vi/ϕ* of *Yi*, see Theorem 2.19 and Remark 2.20. Thus, in that case we can close the dual parameter space and receive MLE *<sup>μ</sup>*MLE <sup>∈</sup> *<sup>M</sup>* (in a possibly degenerate model). In the aforementioned degenerate Poisson situation we receive *<sup>μ</sup>*MLE <sup>=</sup> <sup>0</sup> which is in the boundary *<sup>∂</sup><sup>M</sup>* of the dual parameter space. -

**Definition 3.6 (Bayesian Estimator)** The Bayesian estimator of *θ* for a given observation *<sup>Y</sup>* <sup>∈</sup> <sup>Y</sup> and a given prior distribution *<sup>π</sup>* on is given by (subject to existence)

$$
\widehat{\theta}^{\text{Bayes}} = \widehat{\theta}^{\text{Bayes}}(Y) = \mathbb{E}\_{\mathbb{Z}}[\theta | Y],
$$

where the conditional expectation on the right-hand side is calculated under the posterior distribution *π(θ*|*y)* ∝ *p(y*; *θ )π(θ )* for a given observation *Y* = *y*.

*Example 3.7 (Bayesian Estimator)* Assume that <sup>A</sup> <sup>=</sup> <sup>=</sup> <sup>R</sup> and choose the square loss function *L(θ , a)* <sup>=</sup> *(θ* <sup>−</sup> *a)*2. Assume that for *<sup>ν</sup>*-a.e. *<sup>y</sup>* <sup>∈</sup> <sup>Y</sup> the following decision rule *<sup>A</sup>* : <sup>Y</sup> <sup>→</sup> <sup>A</sup> exists

$$A(\mathbf{y}) = \operatorname\*{arg\,min}\_{a \in \mathbb{A}} \mathbb{E}\_{\pi} [(\theta - a)^2 | \mathbf{Y} = \mathbf{y}],\tag{3.6}$$

where the expectation is calculated w.r.t. the posterior distribution *π(θ*|*y)*. In this case, *<sup>A</sup>* is a Bayesian decision rule w.r.t. *<sup>π</sup>* and *L(θ , a)* <sup>=</sup> *(θ* <sup>−</sup> *a)*2: by assumption (3.6) we have for any other decision rule *A* \* : <sup>Y</sup> <sup>→</sup> <sup>A</sup>, *<sup>ν</sup>*-a.s.,

$$\mathbb{E}\_{\pi} \left[ (\theta - A(Y))^2 | Y = \mathbf{y} \right] \le \mathbb{E}\_{\pi} \| (\theta - \widetilde{A}(Y))^2 | Y = \mathbf{y} \|.$$

Applying the tower property we receive for any other decision rule *A* \*

$$\int\_{\Theta} \mathcal{R}(\theta, A) d\pi(\theta) = \mathbb{E}[\left(\theta - A(Y)\right)^2] \le \mathbb{E}[\left(\theta - \tilde{A}(Y)\right)^2] = \int\_{\Theta} \mathcal{R}(\theta, \tilde{A}) d\pi(\theta),$$

where the expectation E is calculated over the joint distribution of *Y* and *θ*. This proves that *<sup>A</sup>* is a Bayesian decision rule w.r.t. *<sup>π</sup>* and *L(θ , a)* <sup>=</sup> *(θ* <sup>−</sup> *a)*2, see Example 3.3. Finally, note that the conditional expectation given in Definition 3.6 is the minimizer of (3.6). This justifies the name Bayesian estimator in Definition 3.6 (for the square loss function). The case of the Bayesian estimator for a general loss function *L* is considered in Theorem 4.1.1 of Lehmann [244]. -

**Definition 3.8 (Method of Moments Estimator)** Assume that <sup>⊆</sup> <sup>R</sup>*<sup>k</sup>* and that the components *Yi* of *Y* are i.i.d. *F (*·; *θ)* distributed with finite *k*-th moments for all *θ* ∈ . The law of large numbers provides, a.s., for all 1 ≤ *l* ≤ *k*,

$$\lim\_{n \to \infty} \frac{1}{n} \sum\_{l=1}^n Y\_l^l = \mathbb{E}\_{\theta} [Y\_1^l].$$

Assume that the following map is invertible (on suitable range definitions for (3.7)– (3.8))

$$\gamma: \Theta \to \mathbb{R}^k, \qquad \theta \mapsto \gamma(\theta) = \left(\mathbb{E}\_{\theta}[Y\_1], \dots, \mathbb{E}\_{\theta}[Y\_1^k]\right)^\top. \tag{3.7}$$

The method of moments estimator of *θ* is defined by

$$\widehat{\boldsymbol{\theta}}^{\text{MM}} = \widehat{\boldsymbol{\theta}}^{\text{MM}}(\boldsymbol{Y}) = \boldsymbol{\gamma}^{-1} \left( \frac{1}{n} \sum\_{i=1}^{n} \boldsymbol{Y}\_{i}, \dots, \frac{1}{n} \sum\_{i=1}^{n} \boldsymbol{Y}\_{i}^{k} \right)^{\top}. \tag{3.8}$$

The MLE, the Bayesian estimator and the method of moments estimator are the most commonly used parameter estimators. They may have additional properties (under certain assumptions) that we are going to explore below. In the remainder of this section we give an additional view on estimators which is based on the empirical distribution of the observation *Y*.

#### 3.2 Parameter Estimation 55

Assume that the components *Yi* of *Y* are real-valued and i.i.d. *F* distributed. The empirical distribution induced by the observation *Y* = *(Y*1*,...,Yn)*is given by

$$\widehat{F}\_n(\mathbf{y}) = \frac{1}{n} \sum\_{l=1}^n \mathbb{1}\_{\{Y\_l \le \mathbf{y}\}} \qquad \text{for } \mathbf{y} \in \mathbb{R}, \tag{3.9}$$

we also refer to Fig. 1.2 (lhs). The Glivenko–Cantelli theorem [64, 159] tells us that the empirical distribution *F <sup>n</sup>* converges uniformly to *<sup>F</sup>*, a.s., for *<sup>n</sup>* → ∞.

**Definition 3.9 (Fisher-Consistency)** Denote by P the set of all distribution functions on the given probability space. Let *<sup>Q</sup>* : <sup>P</sup> <sup>→</sup> be a functional with the property

$$\mathcal{Q}(F(\cdot;\theta)) = \theta \qquad \text{for all } F(\cdot;\theta) \in \mathcal{F} = \{F(\cdot;\theta); \theta \in \Theta\} \subset \mathfrak{P}.$$

Such a functional is called *Fisher-consistent* for *F* and *θ* ∈ , respectively.

A given Fisher-consistent functional *<sup>Q</sup>* motivates the estimator *<sup>θ</sup>* <sup>=</sup> *Q(F n)* <sup>∈</sup> . This is exactly what we have applied for the method of moments estimator (3.8) with Fisher-consistent functional induced by the inverse of (3.7). The next example shows that this also works for MLE.

*Example 3.10 (MLE and Kullback–Leibler (KL) Divergence)* The MLE can be received from a Fisher-consistent functional. Consider for *<sup>F</sup>* <sup>∈</sup> <sup>P</sup> the functional

$$\mathcal{Q}(F) = \arg\max\_{\widetilde{\boldsymbol{\theta}}} \int \log f(\mathbf{y}; \widetilde{\boldsymbol{\theta}}) dF(\mathbf{y}),$$

assuming that *f (*·;\**θ)* are densities w.r.t. a *<sup>σ</sup>*-finite measure on <sup>R</sup>. Assume that *<sup>F</sup>* has density *f* w.r.t. the *σ*-finite measure *ν* on R. Then, we can rewrite the above as

$$\mathcal{Q}(F) = \operatorname\*{arg\,min}\_{\tilde{\theta}} \int \log \left( \frac{f(\mathbf{y})}{f(\mathbf{y}; \tilde{\theta})} \right) f(\mathbf{y}) d\nu(\mathbf{y}) = \operatorname\*{arg\,min}\_{\tilde{\theta}} D\_{\text{KL}}(f||f(\cdot; \tilde{\theta})).$$

The latter is the Kullback–Leibler (KL) divergence which we have met in Sect. 2.3. Lemma 2.21 states that the KL divergence is non-negative, and it is zero if and only if the two densities *<sup>f</sup>* and *f (*·;\**θ)* are identical, *<sup>ν</sup>*-a.s. This implies that *Q(F (*·; *θ ))* <sup>=</sup> *θ*. Thus, *Q* is Fisher-consistent for *θ* ∈ , assuming identifiability, see Remarks 3.1.

Next, we use this Fisher-consistent functional (KL divergence) to receive the MLE. Replace the unknown distribution *F* by the empirical one to receive

$$\begin{aligned} \mathcal{Q}(\widehat{F}\_n) &= \underset{\widetilde{\theta}}{\text{arg min}} \; D\_{\text{KL}}(\widehat{f}\_n || f(\cdot; \widetilde{\theta})) \\ &= \underset{\widetilde{\theta}}{\text{arg}\max} \; \frac{1}{n} \sum\_{i=1}^n \log f(Y\_i; \widetilde{\theta}) \; = \,\,\widetilde{\theta}^{\text{MLE}}, \end{aligned}$$

where we have used that the empirical density *f <sup>n</sup>* allocates point masses of size 1*/n* to the i.i.d. observations *<sup>Y</sup>*1*,...,Yn*. Thus, the MLE *<sup>θ</sup>*MLE of *<sup>θ</sup>* can be obtained by choosing the model *f (*·;\**θ)*, \**<sup>θ</sup>* <sup>∈</sup> , that is closest in KL divergence to the empirical distribution *F <sup>n</sup>* of i.i.d. observations *Yi* <sup>∼</sup> *<sup>F</sup>*. Note that in this construction we do not assume that the true distribution *F* is in *F*, see Definition 3.9. -

*Remarks 3.11*


## **3.3 Unbiased Estimators**

We introduce the property of uniformly minimum variance unbiased (UMVU) for decision rules in this section. This is a very attractive property in insurance pricing because it gives a quality statement to decision rules (and to the resulting prices). At the current stage it is not clear how unbiasedness is related, e.g., to the MLE of *θ*.

## *3.3.1 Cramér–Rao Information Bound*

Above we have stated some quality criteria for decision rules like the minimax property. A crucial property in financial applications is the so-called *unbiasedness* (for mean estimates) because this guarantees that the overall (price) levels are correctly specified.

**Definition 3.12 (Uniformly Minimum Variance Unbiased, UMVU)** A decision rule *<sup>A</sup>* : <sup>Y</sup> <sup>→</sup> <sup>A</sup> <sup>=</sup> <sup>R</sup> is unbiased for *<sup>γ</sup>* : <sup>→</sup> <sup>R</sup> if for all *Y* ∼ *P (*·; *θ )*, *θ* ∈ , we have

$$\mathbb{E}\_{\theta}[A(Y)] = \chi(\theta). \tag{3.10}$$

The decision rule *A* is called UMVU for *γ* if additionally to the unbiasedness (3.10) we have

$$\text{Var}\_{\theta}(A(Y)) \le \text{Var}\_{\theta}(\check{A}(Y)),$$

for all *θ* ∈ and for any other decision rule *A* \* : <sup>Y</sup> <sup>→</sup> <sup>R</sup> that is unbiased for *γ* .

Note that unbiasedness is not invariant under transformations, i.e., if *A(Y)* is unbiased for *γ (θ )*, then, in general, *b(A(Y))* is not unbiased for *b(γ (θ ))*. For instance, if *b* is strictly convex then we get a counterexample by simply applying Jensen's inequality.

Our first step is to derive a general lower bound for Var*<sup>θ</sup> (A(Y))*. If this general lower bound is met for an unbiased decision rule *A* for *γ* , then we know that it is UMVU for *γ* . We start with the one-dimensional case given in Section 2.6 of Lehmann [244].

**Theorem 3.13 (Cramér–Rao Information Bound)** *Assume that the distributions P (*·; *θ ), θ* ∈ *, have densities p(*·; *θ ) for a given σ-finite measure ν on* <sup>Y</sup>*, and that* <sup>⊂</sup> <sup>R</sup> *is an open interval such that the set* {*y*; *p(y*; *θ) >* <sup>0</sup>} *does not depend on <sup>θ</sup>* <sup>∈</sup> *. Let A(Y) be unbiased for <sup>γ</sup>* : <sup>→</sup> <sup>R</sup> *having finite second moment. If the limit*

$$\frac{\partial}{\partial \theta} \log p(\mathbf{y}; \theta) = \lim\_{\Delta \to 0} \frac{1}{\Delta} \frac{p(\mathbf{y}; \theta + \Delta) - p(\mathbf{y}; \theta)}{p(\mathbf{y}; \theta)}$$

*exists in <sup>L</sup>*2*(P (*·; *θ )) and if*

$$\mathcal{Z}(\theta) = \mathbb{E}\_{\theta} \left[ \left( \frac{\partial}{\partial \theta} \log p(\mathbf{Y}; \theta) \right)^2 \right] \in (0, \infty),$$

*then the function <sup>θ</sup>* <sup>→</sup> *γ (θ ) is differentiable,* <sup>E</sup>*<sup>θ</sup>* [ *<sup>∂</sup> ∂θ* log *p(Y*; *θ )*] = 0 *and we have information bound*

$$\text{Var}\_{\theta}(A(Y)) \ge \frac{\chi'(\theta)^2}{\mathcal{Z}(\theta)}.$$

*Proof* We start from an arbitrary function *<sup>ψ</sup>* : <sup>×</sup> <sup>Y</sup> <sup>→</sup> <sup>R</sup> with finite variance Var*<sup>θ</sup> (ψ(θ , Y ))* ∈ *(*0*,*∞*)* for all *θ* ∈ . The Cauchy–Schwarz inequality implies

$$\text{Var}\_{\theta}(A(\mathbf{Y})) \ge \frac{\text{Cov}\_{\theta}(A(\mathbf{Y}), \psi(\theta, \mathbf{Y}))^2}{\text{Var}\_{\theta}(\psi(\theta, \mathbf{Y}))}.\tag{3.11}$$

If we manage to make the right-hand side of (3.11) independent of decision rule *A(*·*)* we have a general lower bound, we also refer to Theorem 2.6.1 in Lehmann [244].

The Cauchy–Schwarz inequality implies that for any *<sup>U</sup>* <sup>∈</sup> *<sup>L</sup>*2*(P (*·; *θ ))* the following limit exists and is equal to

$$\lim\_{\Delta \to 0} \mathbb{E}\_{\theta} \left[ \frac{1}{\Delta} \frac{p(\mathbf{Y}; \theta + \Delta) - p(\mathbf{Y}; \theta)}{p(\mathbf{Y}; \theta)} U \right] = \mathbb{E}\_{\theta} \left[ \frac{\partial}{\partial \theta} \log p(\mathbf{Y}; \theta) U \right]. \tag{3.12}$$

Setting *<sup>U</sup>* <sup>≡</sup> 1 gives average score <sup>E</sup>*<sup>θ</sup>* [ *<sup>∂</sup> ∂θ* log *p(Y*; *θ )*] = 0 because for sufficiently small

$$\mathbb{E}\_{\theta} \left[ \frac{p(\mathbf{Y}; \theta + \Delta) - p(\mathbf{Y}; \theta)}{p(\mathbf{Y}; \theta)} \right] = \int\_{\mathcal{Y}} \frac{p(\mathbf{y}; \theta + \Delta) - p(\mathbf{y}; \theta)}{p(\mathbf{y}; \theta)} p(\mathbf{y}; \theta) d\nu(\mathbf{y}) = 0,$$

where we have used that the support of the random variables does not depend on *θ* and that the domain of *θ* is open.

Secondly, we set *U* = *A(Y)* in (3.12). We have similarly to above using unbiasedness w.r.t. *γ*

$$\text{Cov}\_{\theta}\left(A(\mathbf{Y}), \frac{p(\mathbf{Y}; \theta + \Delta) - p(\mathbf{Y}; \theta)}{p(\mathbf{Y}; \theta)}\right) = \int\_{\mathcal{Y}} A(\mathbf{y}) \frac{p(\mathbf{y}; \theta + \Delta) - p(\mathbf{y}; \theta)}{p(\mathbf{y}; \theta)} p(\mathbf{y}; \theta) d\nu(\mathbf{y}),$$

$$= \chi(\theta + \Delta) - \chi(\theta).$$

Existence of limit (3.12) provides the differentiability of *γ* . Finally, from (3.11) we have

$$\text{Var}\_{\theta}(A(Y)) \ge \lim\_{\Delta \to 0} \frac{\text{Cov}\_{\theta}\left(A(Y), \frac{p(\mathbf{Y}; \theta + \Delta) - p(\mathbf{Y}; \theta)}{p(\mathbf{Y}; \theta)}\right)^{2}}{\text{Var}\_{\theta}\left(\frac{p(\mathbf{Y}; \theta + \Delta) - p(\mathbf{Y}; \theta)}{p(\mathbf{Y}; \theta)}\right)} = \frac{\mathbf{y}^{\prime}(\theta)^{2}}{\mathcal{Z}(\theta)}.\tag{3.13}$$

This completes the proof.

*Remarks 3.14 (Fisher's Information and Score)*


$$\mathcal{L}(\theta) = \mathbb{E}\_{\theta} \left[ \left( \frac{\partial}{\partial \theta} \log p(\mathbf{Y}; \theta) \right)^{2} \right] = -\mathbb{E}\_{\theta} \left[ \frac{\partial^{2}}{\partial \theta^{2}} \log p(\mathbf{Y}; \theta) \right]. \tag{3.14}$$

Fisher's information *I(θ )* expresses the variance of the score *s(θ , Y)*. Identity (3.14) justifies the notion Fisher's information in Sect. 2.3 for the EF.

• In order to determine the Cramér–Rao information bound for unknown *θ* we need to estimate Fisher's information *I(θ )* from the available data. There are two different ways to do so, either we choose

$$\mathcal{I}(\widehat{\theta}) = \mathbb{E}\_{\widehat{\theta}} \left[ \left( \frac{\partial}{\partial \theta} \log p(Y; \theta) \right)^2 \right],$$

or we choose the *observed Fisher's information*

$$
\widehat{\mathcal{Z}}(\widehat{\theta}) = \left. \left( \frac{\partial}{\partial \theta} \log p(\mathbf{Y}; \theta) \right)^2 \right|\_{\theta = \widehat{\theta}},
$$

for given data *<sup>Y</sup>* and where *<sup>θ</sup>* <sup>=</sup> *θ (Y)*. Both estimated Fisher's information *<sup>I</sup>( θ )* and *<sup>I</sup>( θ )* play a central role in MLE of generalized linear models (GLMs). They are used in Fisher's scoring method, the iterated re-weighted least squares (IRLS) algorithm and the Newton–Raphson algorithm to determine the MLE.

• The Cramér–Rao information bound in Theorem 3.13 is stated in terms of the observation *Y* ∼ *p(*·; *θ )*. Assume that the components *Yi* of *Y* are i.i.d. *f (*·; *θ )* distributed. In this case, Fisher's information scales as

$$\mathcal{L}(\theta) = \mathcal{Z}\_n(\theta) = n \mathcal{Z}\_l(\theta), \tag{3.15}$$

with single risk's Fisher's information (contribution)

$$\mathcal{I}\_{\mathrm{l}}(\theta) = \mathbb{E}\_{\theta} \left[ \left( \frac{\partial}{\partial \theta} \log f(Y\_{\mathrm{l}}; \theta) \right)^{2} \right].$$

In general, Fisher's information is additive in independent random variables, because the product of densities is additive after applying the logarithm, and because the average score is zero.

**Proposition 3.15** *The unbiased decision rule A for γ attains the Cramér– Rao information bound if and only if the density is of the form p(y*; *θ )* = exp {*δ(θ )T (y)* − *β(θ )* + *a(y)*} *with T* = *A. In that case we have γ (θ )* = *β (θ )/δ (θ ).*

*Proof of Proposition 3.15* The Cauchy–Schwarz inequality provides equality in (3.13) if and only if *<sup>∂</sup> ∂θ* log *p(y*; *θ )* = *δ (θ )A(y)*−*β (θ )*, *ν*-a.s, for some functions *δ (θ )* and *β (θ )* on . Integration and the fact that *p(*·; *θ )* is a density whose support does not depend on the explicit choice of *θ* ∈ provide the implication "⇒". For the implication "⇐" we study for *A* = *T*

$$0 = \mathbb{E}\_{\theta} \left[ \frac{\partial}{\partial \theta} \log p(\mathbf{Y}; \theta) \right] = \int\_{\mathcal{Y}} (\delta'(\theta)A(\mathbf{y}) - \beta'(\theta)) p(\mathbf{y}; \theta) d\nu(\mathbf{y}) = \delta'(\theta) \mathbb{E}\_{\theta}[A(\mathbf{Y})] - \beta'(\theta).$$

In that case we have *γ (θ )* <sup>=</sup> <sup>E</sup>*<sup>θ</sup>* [*A(Y)*] = *<sup>β</sup> (θ )/δ (θ )*. Moreover, we have equality in the Cauchy–Schwarz inequality. This finishes the proof.

The single-parameter EF fulfills the properties of Proposition 3.15 with *δ(θ )* = *θ* and *β(θ )* = *κ(θ )*, and decision rule *A(y)* = *T (y)* attains the Cramér–Rao information bound for *γ (θ )* = *κ (θ )*.

We give a multi-dimensional version of the Cramér–Rao information bound.

#### **Theorem 3.16 (Multi-Dimensional Version of the Cramér–Rao**

**Information Bound, Without Proof)** *Assume that the distributions P (*·; *θ), <sup>θ</sup>* <sup>∈</sup> *, have densities p(*·; *<sup>θ</sup>) for a given <sup>σ</sup>-finite measure <sup>ν</sup> on* <sup>Y</sup>*, and that* <sup>⊆</sup> <sup>R</sup>*<sup>k</sup> is an open convex set such that the set* {*y*; *p(y*; *<sup>θ</sup>) >* <sup>0</sup>} *does not depend on <sup>θ</sup>* <sup>∈</sup> *. Let A(Y) be unbiased for <sup>γ</sup>* : <sup>→</sup> <sup>R</sup> *having finite second moment. Under additional regularity conditions, see Theorem 7.3 in Section 2.7 of Lehmann [244], we have*

$$\text{Var}\_{\theta}(A(\mathbf{Y})) \ge \left(\nabla\_{\theta}\boldsymbol{\gamma}(\theta)\right)^{\top} \mathcal{Z}(\boldsymbol{\theta})^{-1} \nabla\_{\theta}\boldsymbol{\gamma}(\theta),$$

*with (positive definite) Fisher's information matrix I(θ)* = *(Il,j (θ))*1≤*l,j*≤*<sup>k</sup> given by*

$$\mathcal{Z}\_{l,j}(\theta) = \mathbb{E}\_{\theta} \left[ \frac{\partial}{\partial \theta^{l}} \log p(Y; \theta) \frac{\partial}{\partial \theta^{j}} \log p(Y; \theta) \right],$$

*for* 1 ≤ *l, j* ≤ *k.*

#### 3.3 Unbiased Estimators 61

#### *Remarks 3.17*


$$\mathbb{E}(\boldsymbol{\theta}) = \mathbb{E}\_{\boldsymbol{\theta}}\left[ \left( \nabla\_{\boldsymbol{\theta}} \log p(\boldsymbol{Y}; \boldsymbol{\theta}) \right) \left( \nabla\_{\boldsymbol{\theta}} \log p(\boldsymbol{Y}; \boldsymbol{\theta}) \right)^{\top} \right] = -\mathbb{E}\_{\boldsymbol{\theta}}\left[ \nabla\_{\boldsymbol{\theta}}^{2} \log p(\boldsymbol{Y}; \boldsymbol{\theta}) \right] \in \mathbb{R}^{k \times k}.$$

Thus, Fisher's information matrix can either be calculated from a quadratic form of the score *s(θ, <sup>Y</sup>)* = ∇*<sup>θ</sup>* log *p(Y*; *<sup>θ</sup>)* or from the Hessian <sup>∇</sup><sup>2</sup> *<sup>θ</sup>* of the log-likelihood *<sup>Y</sup> (θ)* = log *p(Y*; *θ)*. Since the score has mean zero, Fisher's information matrix is equal to the covariance matrix of the score *s(θ, Y)*.

In many situations we do not work under the canonical parametrization *θ*. Considerations then require a change of variable. Assume that

$$
\mathfrak{L} \in \mathbb{R}^r \mapsto \mathfrak{G} = \mathfrak{G}(\xi) \in \mathbb{R}^k,
$$

such that all derivatives *∂θl(ζ )/∂ζj* exist for 1 ≤ *l* ≤ *k* and 1 ≤ *j* ≤ *r*. The Jacobian matrix is given by

$$J(\boldsymbol{\xi}) = \left(\frac{\partial}{\partial \boldsymbol{\xi}\_j} \theta\_l(\boldsymbol{\xi})\right)\_{1 \le l \le k, \, 1 \le j \le r} \in \mathbb{R}^{k \times r}.$$

Fisher's information matrix w.r.t. *ζ* is given by

$$\mathcal{X}^\*(\boldsymbol{\xi}) = \left( \mathbb{E}\_{\boldsymbol{\theta}(\boldsymbol{\xi})} \left[ \frac{\partial}{\partial \boldsymbol{\xi}\_l} \log p(\mathbf{Y}; \boldsymbol{\theta}(\boldsymbol{\xi})) \frac{\partial}{\partial \boldsymbol{\xi}\_j} \log p(\mathbf{Y}; \boldsymbol{\theta}(\boldsymbol{\xi})) \right] \right)\_{1 \le l, j \le r} \in \mathbb{R}^{r \times r},$$

and we have the identity

$$\mathcal{T}^\*(\xi) = J(\xi)^\top \mathcal{Z}(\theta(\xi)) \, J(\xi). \tag{3.16}$$

This formula is used quite frequently, e.g., in generalized linear models when changing the parametrization of the models.

## *3.3.2 Information Bound in the Exponential Family Case*

The purpose of this section is to summarize the Cramér–Rao information bound results for the EF and the EDF, since these families play a distinguished role in statistical and actuarial modeling.

#### **Cramér–Rao Information Bound in the EF Case**

We start with the EF case. Assume we have i.i.d. observations *Y*1*,...,Yn* having densities w.r.t. a *σ*-finite measure *ν* on R given by the EF, see (2.2),

$$dF(\mathbf{y}; \boldsymbol{\theta}) = f(\mathbf{y}; \boldsymbol{\theta})d\boldsymbol{\nu}(\mathbf{y}) = \exp\left\{\boldsymbol{\theta}^{\top}T(\mathbf{y}) - \kappa(\boldsymbol{\theta}) + a(\mathbf{y})\right\}d\boldsymbol{\nu}(\mathbf{y}),$$

for canonical parameter *<sup>θ</sup>* <sup>∈</sup> <sup>⊆</sup> <sup>R</sup>*k*. We assume to work under a minimal representation implying that the cumulant function *κ* is strictly convex on the interior ˚, see Assumption 2.6. Moreover, we assume that the cumulant function *κ* is steep in the sense of Theorem 2.19. Consider the (aggregated) *statistics* of the joint EF *P* = {*P (*·; *θ )*; *θ* ∈ }

$$\mathbf{y} \mapsto S(\mathbf{y}) \stackrel{\text{def.}}{=} \left( \sum\_{l=1}^{n} T\_{\mathbf{l}}(\mathbf{y}\_{l}), \dots, \sum\_{l=1}^{n} T\_{\mathbf{k}}(\mathbf{y}\_{l}) \right)^{\text{l}} \in \mathbb{R}^{k}. \tag{3.17}$$

We calculate the score of this EF

$$s(\theta, Y) = \nabla\_{\theta} \log p(Y; \theta) = \nabla\_{\theta} \left( \theta^{\top} \sum\_{i=1}^{n} T(Y\_i) - n\kappa(\theta) \right) = S(Y) - n \nabla\_{\theta} \kappa(\theta).$$

An immediate consequence of Corollary 2.5 is that the expected value of the score is zero for any *<sup>θ</sup>* <sup>∈</sup> ˚. This then reads as

$$\mu = \mathbb{E}\_{\theta} \left[ T(Y\_{l}) \right] = \mathbb{E}\_{\theta} \left[ S(Y)/n \right] = \nabla\_{\theta} \kappa(\theta) \; \in \; \mathbb{R}^{k}. \tag{3.18}$$

Thus, the statistics *S(Y)/n* is an unbiased decision rule for the mean *μ* = ∇*<sup>θ</sup> κ(θ)*, and we can study its Cramér–Rao information bound. Fisher's information matrix is given by the positive definite matrix

$$\mathcal{Z}(\boldsymbol{\theta}) = \mathcal{Z}\_{\boldsymbol{\theta}}(\boldsymbol{\theta}) = \mathbb{E}\_{\boldsymbol{\theta}}\left[\boldsymbol{s}(\boldsymbol{\theta}, \boldsymbol{Y})\boldsymbol{s}(\boldsymbol{\theta}, \boldsymbol{Y})^{\top}\right] = -\mathbb{E}\_{\boldsymbol{\theta}}\left[\nabla\_{\boldsymbol{\theta}}^{2}\log p(\boldsymbol{Y}; \boldsymbol{\theta})\right] = n\nabla\_{\boldsymbol{\theta}}^{2}\boldsymbol{\kappa}(\boldsymbol{\theta}) \; \in \; \mathbb{R}^{k \times k}.$$

Note that the multi-dimensionally extended Cramér–Rao information bound in Theorem 3.16 applies to the individual components of vector *μ* = ∇*<sup>θ</sup> κ(θ)* ∈ <sup>R</sup>*k*. Assume we would like to estimate its *<sup>j</sup>* -th component, set *γj (θ)* <sup>=</sup> *μj* <sup>=</sup> *(*∇*<sup>θ</sup> κ(θ))j* = *∂κ(θ)/∂θj* , for 1 ≤ *j* ≤ *k*. This corresponds to the *j* -th component *Sj (Y)* of the statistics *S(Y)*. We have unbiasedness of *Sj (Y)/n* for *γj (θ)* = *μj* = *(*∇*<sup>θ</sup> κ(θ))j* , and this unbiased statistics attains the Cramér–Rao information bound

$$\operatorname{Var}\_{\theta}(S\_{j}(\mathbf{Y})/n) = \frac{1}{n} \left(\nabla\_{\theta}^{2} \kappa(\theta)\right)\_{j,j} = (\nabla\_{\theta} \gamma\_{j}(\theta))^{\top} \mathcal{Z}(\theta)^{-1} (\nabla\_{\theta} \gamma\_{j}(\theta)). \tag{3.19}$$

Recall that *<sup>I</sup>(θ)*−<sup>1</sup> scales as *<sup>n</sup>*−1, see (3.15). This provides us with the following corollary.

**Corollary 3.18** *Assume Y*1*,...,Yn are i.i.d. and follow an EF (under a minimal representation). The components of the statistics S(Y)/n are UMVU for γj (θ)* <sup>=</sup> *∂κ(θ)/∂θj ,* <sup>1</sup> <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>k</sup> and <sup>θ</sup>* <sup>∈</sup> ˚*, with*

$$\text{Var}\_{\theta}\left(\frac{1}{n}S\_{j}(Y)\right) = \frac{1}{n}\frac{\partial^{2}}{\partial\theta\_{j}^{2}}\kappa(\theta).$$

*The corresponding covariance terms are for* 1 ≤ *j,l* ≤ *k given by*

$$\operatorname{Cov}\_{\theta} \left( \frac{1}{n} \mathbf{S}\_{\circ}(\mathbf{Y}), \frac{1}{n} \mathbf{S}\_{l}(\mathbf{Y}) \right) = \frac{1}{n} \frac{\partial^{2}}{\partial \theta\_{\circ} \partial \theta\_{l}} \kappa(\theta).$$

The UMVU property stated in Corollary 3.18 is, in general, not related to MLE, but within the EF there is the following link. We have (subject to existence)

$$\widehat{\boldsymbol{\theta}}^{\text{ML.E}} = \operatorname\*{arg\,max}\_{\widetilde{\boldsymbol{\theta}} \in \Theta} p(\boldsymbol{Y}; \widetilde{\boldsymbol{\theta}}) = \operatorname\*{arg\,max}\_{\widetilde{\boldsymbol{\theta}} \in \Theta} \left( \widetilde{\boldsymbol{\theta}}^{\top} \boldsymbol{S}(\boldsymbol{Y}) - n\kappa(\widetilde{\boldsymbol{\theta}}) \right) = h\left( \frac{1}{n} \boldsymbol{S}(\boldsymbol{Y}) \right), \tag{3.20}$$

where *<sup>h</sup>* <sup>=</sup> *(*∇*<sup>θ</sup> κ)*−<sup>1</sup> is the canonical link of this EF, see Definition 2.8; and where we need to ensure that a solution to (3.20) exists; e.g., the solution to (3.20) might be at the boundary of which may cause problems, see Example 3.5. <sup>1</sup> Because the cumulant function *κ* is strictly convex (in a minimal representation), we receive the

<sup>1</sup> Another example where there does not exist a proper solution to the MLE problem (3.20) is, for instance, obtained within the 2-dimensional Gaussian EF if we have only one single observation *Y*1. Intuitively this is clear because we cannot estimate two parameters from one observation *T (Y*1*)* = *(Y*1*, Y*<sup>2</sup> 1 *)*.

MLE for the mean parameter *<sup>μ</sup>* <sup>=</sup> <sup>E</sup>*<sup>θ</sup>* [*T (Y*1*)*]

$$\widehat{\mu}^{\mathsf{MLE}} = \operatorname\*{arg\,max}\_{\widetilde{\mu} \in \overline{\mathcal{M}}} \left( h(\widetilde{\mu})^\top S(Y) - n\kappa \left( h(\widetilde{\mu}) \right) \right) = \frac{1}{n} S(Y),$$

the dual parameter space *<sup>M</sup>* = ∇*<sup>θ</sup> κ()* <sup>⊆</sup> <sup>R</sup>*<sup>k</sup>* has been introduced in Remarks 2.9. If *S(Y)/n* is contained in*M*, then this MLE is a proper solution; otherwise, because we have assumed that the cumulant function *κ* is steep, the MLE exists in the closure *M*, see Theorem 2.19, and it is UMVU for *μ*, see Corollary 3.18.

**Corollary 3.19 (Balance Property)** *Assume Y*1*,...,Yn are i.i.d. and follow an EF with <sup>θ</sup>* <sup>∈</sup> ˚ *and T (Yi)* <sup>∈</sup> *<sup>M</sup>, a.s. The MLE <sup>μ</sup>*MLE <sup>∈</sup> *<sup>M</sup> is UMVU for μ, and it fulfills the balance property on portfolio level, i.e.,*

$$\sum\_{l=1}^{n} \mathbb{E}\_{\widehat{\mu}^{\mathrm{MLE}}} \left[ T(Y\_l) \right] = n \widehat{\mu}^{\mathrm{MLE}} = S(Y).$$

*Remarks 3.20*

• The balance property is a very important property in insurance pricing because it implies that the portfolio is priced on the right level: we have unbiasedness

$$\mathbb{E}\_{\theta} \left[ \sum\_{l=1}^{n} \mathbb{E}\_{\widehat{\mu}^{\text{MLE}}} \left[ T(Y\_{l}) \right] \right] = \mathbb{E}\_{\theta} \left[ S(Y) \right] = n\mu. \tag{3.21}$$


$$\mathbb{E}\_{\theta}\left[\widehat{\theta}^{\text{MLE}}\right] = \mathbb{E}\_{\theta}\left[h\left(\frac{1}{n}S(Y\_n)\right)\right] \\ \quad < h\left(\mathbb{E}\_{\theta}\left[\frac{1}{n}S(Y\_n)\right]\right) = h\left(\mu\right) = \theta. \tag{3.22}$$

• The statistics *S(Y)* is a sufficient statistics of *Y*, this follows from the factorization criterion; see Theorem 1.5.2 of Lehmann [244].

#### **Cramér–Rao Information Bound in the EDF Case**

The single-parameter linear EDF case is very similar to the above vector-valued parameter EF case. We briefly summarize the main results in the EDF case.

Recall Example 3.5: assume that *Y*1*,...,Yn* are independent having densities w.r.t. a *σ*-finite measures on R (not being concentrated in a single point) given by, see (2.14),

$$Y\_l \sim f(\mathbf{y}\_l; \theta, \mathbf{v}\_l/\varphi) = \exp\left\{ \frac{\mathbf{y}\_l \theta - \kappa(\theta)}{\varphi/v\_l} + a(\mathbf{y}\_l; v\_l/\varphi) \right\},\tag{3.23}$$

for 1 ≤ *i* ≤ *n*. Note that these random variables are not i.i.d. because they may differ in the exposures *vi >* 0. The MLE of *μ* = *κ (θ )*, *<sup>θ</sup>* <sup>∈</sup> ˚, is found by, see (3.5),

$$\widehat{\mu}^{\text{MLE}} = \underset{\widetilde{\mu} \in \overline{\mathcal{M}}}{\text{arg}\max} \sum\_{l=1}^{n} \frac{Y\_l h(\widetilde{\mu}) - \kappa(h(\widetilde{\mu}))}{\varphi/v\_l} = \frac{\sum\_{l=1}^{n} v\_l Y\_l}{\sum\_{l=1}^{n} v\_l},\tag{3.24}$$

we assume that *<sup>κ</sup>* is steep to ensure *<sup>μ</sup>*MLE <sup>∈</sup> *<sup>M</sup>*. The convolution formula of Corollary 2.15 says that the MLE *<sup>μ</sup>*MLE <sup>=</sup> *<sup>Y</sup>*<sup>+</sup> belongs to the same EDF with the same canonical parameter *θ* and the same dispersion *ϕ*, only the weight changes to *<sup>v</sup>*<sup>+</sup> <sup>=</sup> *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *vi*.

**Corollary 3.21 (Balance Property)** *Assume Y*1*,...,Yn are independent with EDF distribution* (3.23) *for <sup>θ</sup>* <sup>∈</sup> ˚ *and Yi* <sup>∈</sup> *<sup>M</sup>, a.s. The MLE <sup>μ</sup>*MLE <sup>∈</sup> *<sup>M</sup> is UMVU for μ* = *κ (θ ), and it fulfills the balance property on portfolio level, i.e.,*

$$\sum\_{l=1}^{n} \mathbb{E}\_{\widehat{\mu}^{\text{MLE}}} \left[ \upsilon\_l Y\_l \right] = \sum\_{l=1}^{n} \upsilon\_l \widehat{\mu}^{\text{MLE}} = \sum\_{l=1}^{n} \upsilon\_l Y\_l.$$

The score in this EDF is given by

$$\log(\theta, Y) = \frac{\partial}{\partial \theta} \log p(Y; \theta) = \frac{\partial}{\partial \theta} \sum\_{l=1}^{n} \frac{\upsilon\_l}{\varphi} \left(\theta Y\_l - \kappa(\theta)\right) = \sum\_{l=1}^{n} \frac{\upsilon\_l}{\varphi} \left(Y\_l - \kappa'(\theta)\right).$$

Of course, we have <sup>E</sup>*<sup>θ</sup>* [*s(θ , <sup>Y</sup>)*] = <sup>0</sup> and we receive Fisher's information for *<sup>θ</sup>* <sup>∈</sup> ˚

$$\mathcal{L}(\theta) = -\mathbb{E}\_{\theta} \left[ \frac{\partial^2}{\partial \theta^2} \log p(\mathbf{Y}; \theta) \right] = \sum\_{l=1}^{n} \frac{v\_l}{\varphi} \kappa''(\theta) > 0. \tag{3.25}$$


Corollary 2.15 gives for the variance of the MLE

$$\text{Var}\_{\theta} \left( \widehat{\mu}^{\text{MLE}} \right) = \frac{\varphi}{\sum\_{l=1}^{n} v\_{l}} \kappa''(\theta) = \frac{(\kappa''(\theta))^{2}}{\mathcal{Z}(\theta)} = \frac{(\partial \mu(\theta)/\partial \theta)^{2}}{\mathcal{Z}(\theta)}.$$

This verifies that *<sup>μ</sup>*MLE meets the Cramér–Rao information bound and is UMVU for the mean *μ* = *κ (θ )*.

*Example 3.22 (Poisson Case)* For this example, we consider independent Poisson random variables *Ni* ∼ Poi*(viλ)*. In Sect. 2.2.2 we have seen that *Yi* = *Ni/vi* can be modeled within the single-parameter linear EDF framework using as cumulant function the exponential function *κ(θ )* <sup>=</sup> *<sup>e</sup><sup>θ</sup>* , and setting *ωi* <sup>=</sup> *vi* and *<sup>ϕ</sup>* <sup>=</sup> 1. Thus, the probability weights of a single observation *Yi* are given by, see (2.15),

$$f(\mathbf{y}\_l; \theta, \upsilon\_l) = \exp\left\{\upsilon\_l \left(\theta \mathbf{y}\_l - e^{\theta}\right) + a(\mathbf{y}\_l; \upsilon\_l)\right\},$$

with canonical parameter *<sup>θ</sup>* <sup>=</sup> log*(λ)* <sup>∈</sup> <sup>=</sup> <sup>R</sup>. The MLE in the mean parametrization is given by, see (3.24),

$$\widehat{\lambda}^{\mathsf{MLE}} = \frac{\sum\_{l=1}^{n} \upsilon\_{l} Y\_{l}}{\sum\_{l=1}^{n} \upsilon\_{l}} = \frac{\sum\_{l=1}^{n} N\_{l}}{\sum\_{l=1}^{n} \upsilon\_{l}} \in \overline{\mathcal{M}} = [0, \infty).$$

This estimator is unbiased for *λ*. Having independent Poisson random variables we can calculate the variance of this estimator as

$$\text{Var}\left(\widehat{\lambda}^{\text{MLE}}\right) = \frac{\lambda}{\sum\_{l=1}^{n} \upsilon\_{l}}.$$

Moreover, from Corollary 3.21 we know that this estimator is UMVU for *λ*, which can easily be seen, and uses Fisher's information (3.25) with dispersion parameter *ϕ* = 1

$$\mathcal{Z}(\theta) = -\mathbb{E}\_{\theta} \left[ \frac{\partial^2}{\partial \theta^2} \log p(\mathbf{Y}; \theta) \right] = \sum\_{l=1}^{n} v\_l \kappa''(\theta) = \lambda \sum\_{l=1}^{n} v\_l \dots$$

One could study many other properties of decision rules (and corresponding estimators), for instance, admissibility or uniformly minimum risk equivariance (UMRE), and we could also study other families of distribution functions such as group families. We refrain from doing so because we will not need this for our purposes.

## **3.4 Asymptotic Behavior of Estimators**

All results above have been based on a finite sample *Y<sup>n</sup>* = *(Y*1*,...,Yn)*-, we add a lower index *<sup>n</sup>* to *<sup>Y</sup><sup>n</sup>* to indicate the finite sample size *<sup>n</sup>* <sup>∈</sup> <sup>N</sup>. The aim of this section is to analyze properties of decision rules when the sample size *n* tends to infinity.

## *3.4.1 Consistency*

Assume we have an infinite sequence of observations *Yi*, *i* ≥ 1, which allows us to construct an infinite sequence of decision rules *An* = *An(Yn)*, *n* ≥ 1, where *An* always considers the first *n* observations *Y<sup>n</sup>* = *(Y*1*,...,Yn)*- ∼ *Pn(*·; *θ )*, for *θ* ∈ not depending on *n*. To fix ideas, one may think of i.i.d. random variables *Yi*.

**Definition 3.23 (Consistency)** The sequence *An* <sup>=</sup> *An(Yn)* <sup>∈</sup> <sup>R</sup>*r*, *<sup>n</sup>* <sup>≥</sup> 1, is consistent for *<sup>γ</sup>* : <sup>→</sup> <sup>R</sup>*<sup>r</sup>* if for all *<sup>θ</sup>* <sup>∈</sup> and for all *ε >* 0 we have

$$\lim\_{n \to \infty} \mathbb{P}\_{\theta} \left[ \|A\_n(Y\_n) - \boldsymbol{\nu}(\theta)\|\_{2} > \varepsilon \right] = 0.$$

Definition 3.23 says that *An(Yn)* converges in probability to *γ (θ )* as *n* → ∞. If we (even) have a.s. convergence, we call *An*, *n* ≥ 1,*strongly consistent* for *γ* : → R*r*. Consistency is a minimal property that decision rules should fulfill. Typically, in applications, this is not enough, and we are interested in (fast) rates of convergence, i.e., we would like to know the error rates between *An(Yn)* and *γ (θ )* for *n* → ∞.

*Example 3.24 (Consistency of the MLE in the EF)* We revisit Corollary 3.19 and consider an i.i.d. sequence of random variables *Yi*, *i* ≥ 1, belonging to an EF, and we assume to work under a minimal representation and to have a steep cumulant function *κ*. The MLE for *μ* is given by the statistics

$$\widehat{\mu}\_n^{\text{MLE}} = \frac{1}{n} S(Y\_n) = \frac{1}{n} \sum\_{i=1}^n (T\_1(Y\_i), \dots, T\_k(Y\_i))^\top \in \overline{\mathcal{M}}.$$

We add a lower index *n* to the MLE to indicate the sample size. The i.i.d. property of *Yi*, *i* ≥ 1, implies that we can apply the strong law of large numbers which tells us that we have lim*n*→∞ *<sup>μ</sup>*MLE *<sup>n</sup>* <sup>=</sup> <sup>E</sup>*<sup>θ</sup>* [*T (Y*1*)*] = ∇*<sup>θ</sup> κ(θ)* <sup>=</sup> *<sup>μ</sup>*, a.s., for all *<sup>θ</sup>* <sup>∈</sup> . This implies strong consistency of the sequence of MLEs *<sup>μ</sup>*MLE *<sup>n</sup>* , *n* ≥ 1, for *μ*.

We have seen that these MLEs are also UMVU for *μ*, but if we transform them to the canonical scale *<sup>θ</sup>* MLE *<sup>n</sup>* they are, in general, biased for *θ*, see (3.22). However, since the cumulant function *κ* is strictly convex (under a minimal representation) we receive lim*n*→∞ *<sup>θ</sup>* MLE *<sup>n</sup>* = *θ*, a.s., which provides strong consistency also on the canonical scale. - **Proposition 3.25** *Assume the real-valued random variables Yi, i* ≥ 1*, are i.i.d. F (*·; *θ ) distributed with fixed θ* ∈ *. The resulting empirical distributions F <sup>n</sup>, <sup>n</sup>* <sup>≥</sup> <sup>1</sup>*, are given by* (3.9)*. Assume <sup>Q</sup> is a Fisher-consistent functional for γ (θ ), i.e., Q(F (*·; *θ ))* = *γ (θ ) for all θ* ∈ *. Moreover, assume that Q is continuous in F (*·; *θ ), for all θ* ∈ *, w.r.t. the supremum norm. The functionals Q(F n), <sup>n</sup>* <sup>≥</sup> <sup>1</sup>*, are consistent for γ (θ ).*

*Sketch of Proof* The Glivenko–Cantelli theorem [64, 159] says that the empirical distribution *F <sup>n</sup>* converges uniformly to *F (*·; *θ )*, a.s., for *<sup>n</sup>* → ∞. Using the assumptions made, we are allowed to exchange the corresponding limits, which provides consistency.

In view of Proposition 3.25, we discuss the case of the MLE of *θ* ∈ . In Example 3.10 we have seen that the MLE of *θ* ∈ is obtained from a Fisherconsistent functional *Q* for *θ* on the set of probability distributions P given by

$$Q(F) = \operatorname\*{arg\,max}\_{\widetilde{\theta}} \int \log f(\mathbf{y}; \widetilde{\theta}) dF(\mathbf{y}) = \operatorname\*{arg\,min}\_{\widetilde{\theta}} D\_{\text{KL}}(f || f(\cdot; \widetilde{\theta})),$$

in the second step we assumed that *F* has a density *f* w.r.t. a *σ*-finite measure *ν* on R.

Assume we have i.i.d. data *Yi* ∼ *f (*·; *θ )*, *i* ≥ 1. Thus, the true data generating distribution is described by the parameter *θ* ∈ . MLE requires the study of the log-likelihood function (we scale with the sample size *n*)

$$
\widetilde{\theta} \mapsto \frac{1}{n} \ell\_{Y\_n}(\widetilde{\theta}) = \frac{1}{n} \sum\_{i=1}^n \log f(Y\_i; \widetilde{\theta}).
$$

The law of large numbers gives us, a.s.,

$$\lim\_{n \to \infty} \frac{1}{n} \sum\_{i=1}^{n} \log f(Y\_i; \widetilde{\theta}) = \mathbb{E}\_{\theta} \left[ \log f(Y; \widetilde{\theta}) \right]. \tag{3.26}$$

Thus, *if* we are allowed to exchange the arg max operation and the limit in *n* → ∞ we receive, a.s.,

$$\lim\_{n \to \infty} \widehat{\theta}\_n^{\text{MLE}} = \lim\_{n \to \infty} \left( \arg \max\_{\widetilde{\theta}} \frac{1}{n} \sum\_{i=1}^n \log f(Y\_i; \widetilde{\theta}) \right)$$

$$\overset{?}{=} \arg \max\_{\widetilde{\theta}} \left( \lim\_{n \to \infty} \frac{1}{n} \sum\_{i=1}^n \log f(Y\_i; \widetilde{\theta}) \right)$$

$$= \arg \max\_{\widetilde{\theta}} \mathbb{E}\_{\theta} \left[ \log f(Y; \widetilde{\theta}) \right] \\= \mathcal{Q}(F(\cdot; \theta)) \ = \theta. \qquad (3.27)$$

That is, we receive consistency of the MLE for *θ* if we are allowed to exchange the arg max operation and the limit in *n* → ∞. This requires regularity conditions on the considered family of distributions *F* = {*F (*·; *θ )*; *θ* ∈ }. The case of a finite parameter space = {*θ*1*,...,θJ* } is easy, this is a simplified version of Wald's [374] consistency proof,

$$\mathbb{P}\_{\theta\_j} \left[ \theta\_j \notin \operatorname\*{arg\,max}\_{\theta\_k} \frac{1}{n} \sum\_{i=1}^n \log f(Y\_i; \theta\_k) \right] \le \sum\_{k \ne j} \mathbb{P}\_{\theta\_j} \left[ \frac{1}{n} \sum\_{i=1}^n \log f(Y\_i; \theta\_k) > \frac{1}{n} \sum\_{i=1}^n \log f(Y\_i; \theta\_j) \right].$$

The right-hand side converges to 0 as *n* → ∞ for all *θk* = *θj* , which gives consistency. For regularity conditions on more general parameter spaces we refer to Section 5.2 in Van der Vaart [363]. Basically, one needs that the arg max of the limiting function given on the right-hand side of (3.26) is well-separated from other large values of that function, see Theorem 5.7 in Van der Vaart [363].

#### *Remarks 3.26*

• The estimator from the arg max operation in (3.27) is also called M-estimator, and *(y, a)* → log*(f (y*; *a))* plays the role of a scoring function (similar to a loss function). The the last line of (3.27) says that this scoring function is strictly consistent for the functional *Q* : *F* → , and Fisher-consistency of this functional *Q* implies

$$\mathbb{E}\_{\theta} \left[ \log f(Y; \widetilde{\theta}) \right] \le \mathbb{E}\_{\theta} \left[ \log f(Y; \mathcal{Q}(F(\cdot; \theta))) \right] = \mathbb{E}\_{\theta} \left[ \log f(Y; \theta) \right],$$

for all \**<sup>θ</sup>* <sup>∈</sup> . Strict consistency of loss and scoring functions is going to be defined formally in Sect. 4.1.3, below, and we have just seen that this plays an important role for the consistency of M-estimators in the sense of Definition 3.23.

• Consistency (3.27) assumes that the data generating model *Y* ∼ *F* belongs to the specified family *F* = {*F (*·; *θ )*; *θ* ∈ }. Model uncertainty may imply that the data generating model does not belong to *F*. In this situation, and if we are allowed to exchange the arg max operation and the limit in *n* in (3.27), the MLE will provide the model in *F* that is closest in KL divergence to the true model *F*. We come back to this in Sect. 11.1.4, below.

## *3.4.2 Asymptotic Normality*

As mentioned above, typically, we would like to have stronger results than just consistency. We give an introductory example based on the EF.

*Example 3.27 (Asymptotic Normality of the MLE in the EF)* We work under the same EF as in Example 3.24. This example has provided consistency of the sequence of MLEs *<sup>μ</sup>*MLE *<sup>n</sup>* , *n* ≥ 1, for *μ*. Note that the i.i.d. property together with the finite variance property immediately implies the following convergence in distribution

$$\sqrt{n}\left(\widehat{\mu}\_n^{\mathrm{MLE}} - \mu\right) \Rightarrow \mathcal{N}(0, \nabla\_\theta^2 \kappa(\theta)) \stackrel{(\mathrm{d})}{=} \mathcal{N}\left(0, \mathcal{Z}\_\mathrm{l}(\theta)\right) \qquad \text{ as } n \to \infty,$$

where *<sup>θ</sup>* <sup>=</sup> *<sup>θ</sup>(μ)* <sup>=</sup> *(*∇*<sup>θ</sup> κ)*−1*(μ)* <sup>∈</sup> for *<sup>μ</sup>* <sup>∈</sup> *<sup>M</sup>*, and *<sup>N</sup>* denotes the Gaussian distribution. This is the multivariate version of the central limit theorem (CLT), and it tells us that the rate of convergence is 1*/* <sup>√</sup>*n*. This asymptotic result is stated in terms of Fisher's information matrix under parametrization *θ*. We transform this to the dual mean parametrization and call Fisher's information matrix under the dual mean parametrization *I*<sup>∗</sup> <sup>1</sup> *(μ)*. This involves the change of variable *μ* → *θ* = *<sup>θ</sup>(μ)* <sup>=</sup> *(*∇*<sup>θ</sup> κ)*−1*(μ)*. The Jacobian matrix of this change of variable is given by *J (μ)* <sup>=</sup> *<sup>I</sup>*1*(θ(μ))*−<sup>1</sup> and, thus, the transformation of Fisher's information matrix gives, see also (3.16),

$$\mu \leftrightarrow \mathcal{T}\_1^\*(\mu) = J(\mu)^\top \mathcal{T}\_1(\theta(\mu)) \, J(\mu) = \mathcal{T}\_1(\theta(\mu))^{-1}.$$

This allows us to express the above CLT w.r.t. Fisher's information matrix corresponding to *μ* and it gives us

$$\sqrt{n}\left(\widehat{\mu}\_n^{\mathrm{MI.E}} - \mu\right) \implies \mathcal{N}\left(0, \mathcal{Z}\_1^\*(\mu)^{-1}\right) \qquad \text{as } n \to \infty. \tag{3.28}$$

We conclude that the appropriately normalized MLE *<sup>μ</sup>*MLE *<sup>n</sup>* converges in distribution to the centered Gaussian distribution having as covariance matrix the *inverse of Fisher's information matrix I*<sup>∗</sup> <sup>1</sup> *(μ)*, and the *rate of convergence* is 1*/* <sup>√</sup>*n*.

Assume that the effective domain is open, and that *θ* = *θ(μ)* ∈ . This allows us to transform asymptotic normality (3.28) to the canonical scale. Consider again the change of variable *<sup>μ</sup>* <sup>→</sup> *<sup>θ</sup>* <sup>=</sup> *<sup>θ</sup>(μ)* <sup>=</sup> *(*∇*<sup>θ</sup> κ)*−1*(μ)* with Jacobian matrix *J (μ)* <sup>=</sup> *<sup>I</sup>*1*(θ(μ))*−<sup>1</sup> <sup>=</sup> *<sup>I</sup>*<sup>∗</sup> <sup>1</sup> *(μ)*. Theorem 1.9 in Section 5.2 of Lehmann [244] tells us how the CLT transforms under such a change of variable, namely,

$$\sqrt{n}\left(\widehat{\theta}\_{n}^{\text{MLE}} - \theta\right) = \sqrt{n}\left((\nabla\_{\theta}\kappa)^{-1}\left(\widehat{\mu}\_{n}^{\text{MLE}}\right) - (\nabla\_{\theta}\kappa)^{-1}(\mu)\right) \tag{3.29}$$

$$\Rightarrow \mathcal{N}\left(0, J(\mu)\mathcal{Z}\_{1}^{\text{s}}(\mu)^{-1}J(\mu)\right) \stackrel{\text{(d)}}{=} \mathcal{N}\left(0, \mathcal{Z}\_{1}(\theta)^{-1}\right) \qquad \text{ as } n \to \infty.$$

We have exactly the same structural form in the two asymptotic results (3.28) and (3.29). There is a main difference, *<sup>μ</sup>*MLE *<sup>n</sup>* is unbiased for *μ* whereas, in general, *θ* MLE *<sup>n</sup>* is not unbiased for *θ*, but we receive the same asymptotic behavior. -

There are many different versions of asymptotic normality results similar to (3.28) and (3.29), and the main difficulty often is to verify the assumptions made. For instance, one can prove asymptotic normality based on a Fisher-consistent functional *Q*. The assumptions made are, among others, that *Q* needs to be Fréchet differentiable in *P (*·; *θ )* which, unfortunately, is rather difficult to verify. We make a list of assumptions here that are easier to check and then we give a version of the asymptotic normality result which is stated in the book of Lehmann [244]. This list of assumptions in the one-dimensional case <sup>⊆</sup> <sup>R</sup> reads as follows:


$$\left|\frac{\partial^3}{\partial \theta^3} \log f(\mathbf{y}; \theta)\right| \le M(\mathbf{y}) \qquad \text{for all } \mathbf{y} \in \mathfrak{T} \text{ and } \theta \in (\theta\_0 - c, \theta\_0 + c).$$

**Theorem 3.28 (Theorem 2.3 in Section 6.2 of Lehmann [244])** *Assume Yi, i* ≥ 1*, are i.i.d. F (*·; *θ ) distributed satisfying* (i)*–*(vi) *from above. Assume that θn* <sup>=</sup> *θn(Yn), <sup>n</sup>* <sup>≥</sup> <sup>1</sup>*, is a sequence of roots that solves the score equations*

$$\frac{\partial}{\partial \widetilde{\theta}} \sum\_{l=1}^n \log f(Y\_l; \widetilde{\theta}) = \frac{\partial}{\partial \widetilde{\theta}} \ell\_{Y\_n}(\widetilde{\theta}) = 0,$$

*and which is consistent for <sup>θ</sup>, i.e. this sequence of roots θn(Yn) converges in probability to the true parameter θ. Then we have asymptotic normality*

$$\sqrt{n}\left(\widehat{\theta}\_{n}-\theta\right) \Rightarrow \mathcal{N}\left(0, \mathbb{Z}\_{1}(\theta)^{-1}\right) \qquad \text{as } n \to \infty. \tag{3.30}$$

*Sketch of Proof* Fix *θ* ∈ and consider a Taylor expansion of the score *Yn (*·*)* in *<sup>θ</sup>* for *θn*. It is given by

$$\ell\_{\mathbf{Y}\_n}'(\widehat{\theta}\_n) = \ell\_{\mathbf{Y}\_n}'(\theta) + \ell\_{\mathbf{Y}\_n}''(\theta) \left(\widehat{\theta}\_n - \theta\right) + \frac{1}{2} \ell\_{\mathbf{Y}\_n}''(\theta\_n) \left(\widehat{\theta}\_n - \theta\right)^2,$$

for *θn* ∈ [*θ , θn*]. Since *θn* is a root of the score, the left-hand side is equal to zero. This allows us to re-arrange the above Taylor expansion as follows

$$\sqrt{n}\left(\widehat{\theta}\_{n}-\theta\right) = \frac{\frac{1}{\sqrt{n}}\ell\_{Y\_{n}}'(\theta)}{-\frac{1}{n}\ell\_{Y\_{n}}'(\theta) - \frac{1}{\sqrt{n}}\ell\_{Y\_{n}}''(\theta\_{n})\left(\widehat{\theta}\_{n}-\theta\right)}.$$

The enumerator on the right-hand side converges in distribution to *N (*0*, I*1*(θ ))*, see (18) in Section 6.2 of [244], the first term in the denominator converges in probability to *I*1*(θ )*, see (19) in Section 6.2 of [244], and in the second term of the denominator we have <sup>1</sup> <sup>2</sup>*<sup>n</sup> <sup>Y</sup><sup>n</sup> (θn)* which is bounded in probability, see (20) in Section 6.2 of [244]. The claim then follows from Slutsky's theorem.

#### *Remarks 3.29*


$$\sqrt{n}\left(\boldsymbol{\chi}\left(\widehat{\boldsymbol{\theta}}\_{\boldsymbol{\theta}}\right)-\boldsymbol{\chi}\left(\boldsymbol{\theta}\right)\right)\;\Rightarrow\;\mathcal{N}\left(0,\frac{\left(\boldsymbol{\chi}'\left(\boldsymbol{\theta}\right)\right)^{2}}{\mathcal{Z}\_{\boldsymbol{\mathsf{I}}}(\boldsymbol{\theta})}\right)\qquad\text{ as }n\to\infty.\tag{3.31}$$

This follows from asymptotic normality, consistency and considering a Taylor expansion around *θ*.

• We were starting from the MLE problem

$$\widehat{\theta}\_n^{\mathsf{MLLE}} = \arg\max\_{\widetilde{\theta}} \frac{1}{n} \sum\_{i=1}^n \log f(Y\_i; \widetilde{\theta}). \tag{3.32}$$

#### 3.4 Asymptotic Behavior of Estimators 73

In statistical theory a parameter estimator that is obtained through a maximization operation is called M-estimator (for maximizing or minimizing), see also Remarks 3.26. If the log-likelihood is differentiable in \**<sup>θ</sup>* we can turn the above problem into a root search problem for \**<sup>θ</sup>*

$$\frac{1}{n}\sum\_{l=1}^{n}\frac{\partial}{\partial\tilde{\theta}}\log f(Y\_l;\tilde{\theta})=0.\tag{3.33}$$

If a parameter estimator is obtained through a root search problem it is called Z-estimator (for equating to zero). The Z-estimator (3.33) does not require a maximum of the original function, but only a critical point; this is exactly what we have been exploring in Theorem 3.28. More generally, for a sufficiently nice function *ψ(*·; *θ )* a Z-estimator *<sup>θ</sup>*<sup>Z</sup> *<sup>n</sup>* for *θ* is obtained by solving the following equation for \**<sup>θ</sup>*

$$\frac{1}{n}\sum\_{i=1}^{n}\psi(Y\_i;\widetilde{\theta})=0,\tag{3.34}$$

for i.i.d. data *Yi* <sup>∼</sup> *F (*·; *θ )*. Suppose that the first moment of *ψ(Yi*;\**θ)* exists. The law of large numbers gives us, a.s., see also (3.26),

$$\lim\_{n \to \infty} \frac{1}{n} \sum\_{i=1}^{n} \psi(Y\_i; \tilde{\theta}) = \mathbb{E}\_{\theta} \left[ \psi(Y; \tilde{\theta}) \right]. \tag{3.35}$$

Consistency of the Z-estimator *<sup>θ</sup>*<sup>Z</sup> *<sup>n</sup>* , *n* ≥ 1, for *θ* is related to the right-hand side of (3.35) being zero for \**<sup>θ</sup>* <sup>=</sup> *<sup>θ</sup>*. Under additional regularity conditions (and consistency) it then holds asymptotic normality

$$\sqrt{n}\left(\widehat{\theta}\_{n}^{\mathbb{Z}}-\theta\right)\Rightarrow\mathcal{N}\left(0,\frac{\mathbb{E}\_{\theta}\left[\psi(Y;\theta)^{2}\right]}{\mathbb{E}\_{\theta}\left[\frac{\partial}{\partial\theta}\psi(Y;\theta)\right]^{2}}\right)\qquad\text{as }n\to\infty.\tag{3.36}$$

For rigorous statements we refer to Theorems 5.21 and 5.41 in Van der Vaart [363]. A modification to the regression case is given in Theorem 11.6 below.

*Example 3.30* We consider the single-parameter linear EF for given strictly convex and steep cumulant function *κ* and w.r.t. a *σ*-finite measure *ν* on R. The score equation gives requirement

$$\frac{1}{n}S(Y\_n) \stackrel{!}{=} \kappa'(\theta) = \mathbb{E}\_{\theta}[Y\_1].\tag{3.37}$$

Strict convexity implies that the right-hand side strictly increases in *θ*. Therefore, we have at most one solution of the score equation here. We assume that the effective domain <sup>⊆</sup> <sup>R</sup> is open. It is easily verified that assumptions (ii)–(vi) hold, in particular, (vi) saying that the third derivative should have a uniformly bounded integrable bound holds because the third derivative is independent of *y* and continuous in *<sup>θ</sup>*. With probability converging to 1, (3.37) has a solution *θn* which is unique, consistent and Theorem 3.28 holds. Note that in Example 3.5 we have mentioned the Poisson case which can be degenerate. For the asymptotic normality result we use here that this degeneracy asymptotically vanishes with probability converging to one. -

*Remark 3.31 (Multi-Dimensional Extension)* For an extension of Theorem 3.28 to the multi-dimensional case <sup>⊆</sup> <sup>R</sup>*<sup>k</sup>* we refer to Section 6.4 in Lehmann [244]. The assumptions made in the multi-dimensional case do not essentially differ from the ones in the 1-dimensional case.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 4 Predictive Modeling and Forecast Evaluation**

In the previous chapter, we have fully focused on parameter estimation *θ* ∈ and the estimation of functions *θ* → *γ (θ )* by exploiting decision rules *A* for estimating *<sup>Y</sup><sup>n</sup>* <sup>→</sup> *<sup>θ</sup>* <sup>=</sup> *A(Yn)* or *<sup>Y</sup><sup>n</sup>* <sup>→</sup> *γ (θ )* <sup>=</sup> *A(Yn)*, respectively. The derivations in that chapter analyzed the quality of decision rules in terms of loss functions which compare, e.g., the action *<sup>θ</sup>* <sup>=</sup> *A(Yn)* to the true parameter *<sup>θ</sup>*. The Cramér–Rao information bound considers this in terms of a square loss function. In actuarial modeling, parameter estimation is only part of the problem, and the second part is to predict new random variables *Y* . These new random variables should be thought as claims in the future that we try to predict (and price) using decision rules being developed based on past information *Y<sup>n</sup>* = *(Y*1*,...,Yn)*-. In this case, we would like to study how a decision rule *A(Yn) generalizes* to new data *Y* , and we then call the decision rule rather a *predictor* for *Y* . This capability of suitable decision rules to generalize to new (unseen) data is analyzed in Sect. 4.1. Such an analysis often relies on (numerical) techniques such as cross-validation, which is examined in Sect. 4.2, or the bootstrap technique, being presented in Sect. 4.3, below. In this chapter, we denote past observations by *Y<sup>n</sup>* = *(Y*1*,...,Yn)* supported on Y, and the (real-valued) random variables to be predicted are denoted by *Y* with support *<sup>Y</sup>* <sup>⊂</sup> <sup>R</sup>. Often we have <sup>Y</sup> <sup>=</sup> *<sup>Y</sup>* ×···× *<sup>Y</sup>*.

## **4.1 Generalization Loss**

We start by considering the most commonly used *expected generalization loss* (GL) which is the *mean squared error of prediction* (MSEP). The MSEP is based on the square loss function, and it can be seen as a distribution-free approach to measure expected GL. In subsequent sections we will study distribution-adapted GL approaches. Expected GL measurement with MSEP is considered to be general knowledge and we do not give a specific reference in this section. Distributionadapted versions are mainly based on the strictly consistent scoring framework of Gneiting–Raftery [163] and Gneiting [162]. In particular, we will discuss *deviance losses* in Sect. 4.1.2 that are strictly consistent scoring functions for mean estimation and, hence, provide proper scoring rules.

## *4.1.1 Mean Squared Error of Prediction*

We denote by *Y<sup>n</sup>* = *(Y*1*,...,Yn)* - (past) observations on which predictors and decision rules *<sup>A</sup>* : <sup>Y</sup> <sup>→</sup> <sup>A</sup> are based on. The new observation that we would like to predict is denoted by *<sup>Y</sup>* having support *<sup>Y</sup>* <sup>⊂</sup> <sup>R</sup>. In the previous chapter we have used decision rule the *A(Yn)* to estimate an unknown quantity *γ (θ )*. In this section we will use this decision rule to directly predict the new (unseen) observation *Y* .

**Theorem 4.1 (Mean Squared Error of Prediction, MSEP)** *Assume that <sup>Y</sup><sup>n</sup> and <sup>Y</sup> are independent. Assume that the predictor <sup>A</sup>* : <sup>Y</sup> <sup>→</sup> <sup>A</sup> <sup>⊆</sup> <sup>R</sup>*, Y<sup>n</sup>* → *A(Yn) has finite second moment, and that the real-valued random variable Y has finite second moment, too. The MSEP of predictor A to predict Y is given by*

$$\mathbb{E}\left[\left(Y - A(Y\_n)\right)^2\right] = \left(\mathbb{E}\left[Y\right] - \mathbb{E}\left[A(Y\_n)\right]\right)^2 + Var(A(Y\_n)) + Var(Y). \tag{4.1}$$

*Proof of Theorem 4.1* We compute

$$\mathbb{E}\left[\left(A(Y\_n) - Y\right)^2\right] = \mathbb{E}\left[\left(A(Y\_n) - \mathbb{E}[Y] + \mathbb{E}[Y] - Y\right)^2\right]$$

$$= \mathbb{E}\left[\left(A(Y\_n) - \mathbb{E}[Y]\right)^2\right] + \mathbb{E}\left[\left(\mathbb{E}[Y] - Y\right)^2\right]$$

$$\qquad + 2\,\mathbb{E}\left[\left(A(Y\_n) - \mathbb{E}[Y]\right)\left(\mathbb{E}[Y] - Y\right)\right]$$

$$= \mathbb{E}\left[\left(\mathbb{E}\left[Y\right] - \mathbb{E}\left[A(Y\_n)\right] + \mathbb{E}\left[A(Y\_n)\right] - A(Y\_n)\right)^2\right] + \text{Var}(Y)$$

$$= \left(\mathbb{E}\left[Y\right] - \mathbb{E}\left[A(Y\_n)\right]\right)^2 + \text{Var}(A(Y\_n)) + \text{Var}(Y),$$

where on the second last line we use the independence between *Y<sup>n</sup>* and *Y* . This finishes the proof.

#### *Remarks 4.2 (Expected Generalization Loss)*

• The quantity <sup>E</sup>[*(Y* <sup>−</sup> *A(Yn))*<sup>2</sup>] is an expected GL because it measures how well the decision rule (predictor) *A(Yn)* generalizes to new (unseen) data *Y* . As loss function we use the square loss function

$$L: \mathcal{Y} \times \mathbb{A} \to \mathbb{R}\_{+}, \qquad (\mathbf{y}, a) \mapsto L(\mathbf{y}, a) = (\mathbf{y} - a)^{2}. \tag{4.2}$$

Therefore, this expected GL is called MSEP.

• MSEP (4.1) is called *expected* GL. If we condition on *Yn*, then we call it GL. For the square loss function the GL (conditional MSEP) is given by

$$\mathbb{E}\left[\left(Y - A(Y\_n)\right)^2\Big|\mathcal{Y}\_n\right] = \left(\mathbb{E}\left[Y\right] - A(\mathcal{Y}\_n)\right)^2 + \text{Var}(Y),\tag{4.3}$$

where we have used independence between *Y* and *Yn*.

	- The first term *(*E[*<sup>Y</sup>* ] <sup>−</sup> <sup>E</sup>[*A(Yn)*]*)* <sup>2</sup> is the (squared) *bias*. Obviously, good decision rules *A(Yn)* under the MSEP should be unbiased for <sup>E</sup>[*<sup>Y</sup>* ]. If we compare this to the previous chapter, we note that now the bias is measured w.r.t. the mean of the new observation *Y* . Additionally, there might be a slight difference to the previous chapter if *Y<sup>n</sup>* and *Y* do not belong to the same parameter *θ* ∈ (if we work in a parametrized family): the risk function in (3.3) considers *<sup>R</sup>(θ , A)* <sup>=</sup> <sup>E</sup>*<sup>θ</sup>* [*L(θ , A(Yn))*] with both components of the loss function *L* belonging to the same parameter value *θ*. For the MSEP we replace *θ* in *L(θ , A(Yn))* by the new observation *Y* that might originate from a different distribution (or from a randomized *θ* in a Bayesian case).
	- The second term Var*(A(Yn))* is called *estimation variance* or *statistical error*.
	- The last term Var*(Y )* is called *process variance* or *irreducible risk*. It reflects the pure randomness received from the fact that we try to predict random variables *<sup>Y</sup>* with deterministic means <sup>E</sup>[*<sup>Y</sup>* ].

• We emphasize that in financial applications we typically aim for unbiased estimators for <sup>E</sup>[*<sup>Y</sup>* ], we especially refer to Sect. 7.4.2 that studies the balance property in network regression models under a stationary portfolio assumption. Here, this stationarity may, e.g., translate into a (stronger) i.i.d. assumption on *Y*1*,...,Yn, Y* . Unbiasedness then implies that the predictor *A(Yn)* is optimal in (4.1) if it meets the Cramér–Rao information bound, see Theorem 3.13.

Theorem 4.1 considers the MSEP which implicitly assumes that the square loss function is the objective (scoring) function of interest. The square loss function may be considered as being distribution-free, but it is motivated by a Gaussian model for *Y<sup>n</sup>* and *Y* , respectively; this will be justified in Remarks 4.6, below. If we use the square loss function for observations different from Gaussian ones it might underor over-weigh particular characteristics in these observations because they may not look very Gaussian (e.g. more heavy-tailed). Therefore, we should always choose a scoring function that fits the problem considered, for instance, a square loss function is not appropriate if we model claim counts following a Poisson distribution. We close this section with the example of the EDF.

*Example 4.3 (MSEP Within the EDF)* We choose a fixed single-parameter linear EDF satisfying Assumption 2.6 and having a steep cumulant function *κ*, see Theorem 2.19 and Remark 2.20. Assume we have independent random variables *Y*1*,...,Yn, Y* belonging to this EDF having densities, see Example 3.5,

$$Y\_l \sim f(\mathbf{y}\_l; \theta, v\_l/\varphi) = \exp\left\{\frac{\mathbf{y}\_l \theta - \kappa(\theta)}{\varphi/v\_l} + a(\mathbf{y}\_l; v\_l/\varphi)\right\},\tag{4.4}$$

and similarly for *Y* ∼ *f (y*; *θ , v/ϕ)*. Note that all random variables share the same canonical parameter *<sup>θ</sup>* <sup>∈</sup> ˚. The MLE of *<sup>μ</sup>* <sup>∈</sup> *<sup>M</sup>* based on *<sup>Y</sup><sup>n</sup>* <sup>=</sup> *(Y*1*,...,Yn)* is found by solving, see (3.4)–(3.5),

$$
\widehat{\mu}^{\text{MLE}} = \widehat{\mu}^{\text{MLE}}(Y\_n) = \underset{\widetilde{\mu} \in \overline{\mathcal{M}}}{\text{arg}\, \max} \, \ell\_{Y\_n}(\widetilde{\mu}) \tag{4.5}
$$

$$
= \underset{\widetilde{\mu} \in \overline{\mathcal{M}}}{\text{arg}\, \max} \sum\_{l=1}^n \frac{Y\_l h(\widetilde{\mu}) - \kappa \left(h(\widetilde{\mu})\right)}{\varphi/v\_l},
$$

with canonical link *h* = *(κ )*−1. Since the cumulant function *κ* is strictly convex and assumed to be steep, there exists a unique solution *<sup>μ</sup>*MLE <sup>∈</sup> *<sup>M</sup>*. If *<sup>μ</sup>*MLE <sup>∈</sup> *<sup>M</sup>* we have a proper solution providing *<sup>θ</sup>*MLE <sup>=</sup> *h( <sup>μ</sup>*MLE*)* <sup>∈</sup> , otherwise *<sup>μ</sup>*MLE provides a degenerate model. This decision rule *<sup>Y</sup><sup>n</sup>* <sup>→</sup> *<sup>μ</sup>*MLE <sup>=</sup> *<sup>μ</sup>*MLE*(Yn)* is now used to predict the (independent) new random variable *Y* and to estimate the unknown parameters *θ* and *μ*, respectively. That is, we use the following predictor for *Y*

$$Y\_n \mapsto \widehat{Y} = \widehat{\mathbb{E}}\_{\theta}[Y] = \mathbb{E}\_{\widehat{\theta}^{\mathsf{MLE}}}[Y] = \widehat{\mu}^{\mathsf{MLE}} = \widehat{\mu}^{\mathsf{MLE}}(Y\_n).$$

Note that this predictor *Y* is used to predict an unobserved (new) random variable *Y* , and it is itself a random variable as a function of (independent) past observations *Yn*. We calculate the MSEP in this model. Using Theorem 4.1 we obtain

$$\mathbb{E}\_{\theta}\left[\left(Y-\hat{\mu}^{\text{MLLE}}\right)^{2}\right] = \left(\mathbb{E}\_{\theta}\left[Y\right]-\mathbb{E}\_{\theta}\left[\hat{\mu}^{\text{MLLE}}\right]\right)^{2} + \text{Var}\_{\theta}\left(\hat{\mu}^{\text{MLLE}}\right) + \text{Var}\_{\theta}(Y)$$

$$= \left(\kappa'(\theta)-\kappa'(\theta)\right)^{2} + \frac{\varphi\kappa''(\theta)}{\sum\_{l=1}^{n}v\_{l}} + \frac{\varphi\kappa''(\theta)}{v} \tag{4.6}$$

$$= \frac{(\kappa''(\theta))^{2}}{\mathcal{Z}(\theta)} + \frac{\varphi\kappa''(\theta)}{v},$$

see (3.25) for Fisher's information *I(θ )*. In this calculation we have used that the MLE *<sup>μ</sup>*MLE is UMVU for *<sup>μ</sup>* <sup>=</sup> *<sup>κ</sup> (θ )* and that *Y<sup>n</sup>* and *Y* come from the same EDF with the same canonical parameter *<sup>θ</sup>* <sup>∈</sup> ˚. As a result, we are only left with estimation variance and process variance, moreover, the estimation variance asymptotically vanishes as *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *vi* → ∞. -

## *4.1.2 Unit Deviances and Deviance Generalization Loss*

The main estimation technique used in these notes is MLE introduced in Definition 3.4. At this stage, MLE is un-related to any specific scoring function *L* because it has been received by maximizing the log-likelihood function. In this section we discuss the deviance loss function (as a scoring function) and we highlight its connection to the Bregman divergence introduced in Sect. 2.3. Based on the deviance loss function choice we rephrase Theorem 4.1 in terms of this scoring function. A theoretical foundation to these considerations will be given in Sect. 4.1.3, below.

For the derivations in this section we rely on the same single-parameter linear EDF as in Example 4.3, having a steep cumulant function *κ*. The MLE of *μ* = *κ(θ )* is found by solving, see (4.5),

$$\widehat{\mu}^{\text{MLE}} = \widehat{\mu}^{\text{MLE}}(Y\_n) = \underset{\widetilde{\mu} \in \overline{\mathcal{M}}}{\text{arg}\max} \sum\_{l=1}^n \frac{Y\_l h(\widetilde{\mu}) - \kappa(h(\widetilde{\mu}))}{\varphi/v\_l} \in \overline{\mathcal{M}},$$

with canonical link *h* = *(κ )*−1. This decision rule *<sup>Y</sup><sup>n</sup>* <sup>→</sup> *<sup>μ</sup>*MLE <sup>=</sup> *<sup>μ</sup>*MLE*(Yn)* is now used to predict the (new) random variable *Y* and to estimate the unknown parameters *θ* and *μ*, respectively. We aim at studying the expected GL under a distribution-adapted loss function choice potentially different from the square loss function. Below we will justify this second choice more extensively.

For the *saturated model* the *common* canonical parameter *θ* of the independent random variables *Y*1*,...,Yn* in (4.4) is replaced by *individual* canonical parameters *θi*, 1 ≤ *i* ≤ *n*. These individual canonical parameters are estimated with individual MLEs. The individual MLEs are given by, respectively,

$$
\widehat{\theta}\_{l}^{\mathsf{MLE}} = (\kappa')^{-1} \left( Y\_{l} \right) = h \left( Y\_{l} \right) \qquad \text{and} \qquad \widehat{\mu}\_{l}^{\mathsf{MLE}} = Y\_{l} \in \overline{\mathcal{M}},
$$

the latter always exists because of strict convexity and steepness of *κ*. Since the MLE *<sup>μ</sup>*MLE *<sup>i</sup>* = *Yi* maximizes the log-likelihood, we receive for any *μ* ∈ *M* the inequality

$$\begin{split} 0 &\le 2\left(\log f\left(Y\_{l}; h\left(Y\_{l}\right), v\_{l}/\varphi\right) - \log f\left(Y\_{l}; h\left(\mu\right), v\_{l}/\varphi\right)\right) \\ &= 2\frac{v\_{l}}{\varphi}\left(Y\_{l}h\left(Y\_{l}\right) - \kappa\left(h\left(Y\_{l}\right)\right) - Y\_{l}h\left(\mu\right) + \kappa\left(h\left(\mu\right)\right)\right) \\ &= \frac{v\_{l}}{\varphi}\mathfrak{d}\left(Y\_{l}, \mu\right). \end{split} \tag{4.7}$$

The function *(y, μ)* <sup>→</sup> <sup>d</sup>*(y, μ)* <sup>≥</sup> <sup>0</sup> is the unit deviance introduced in (2.25), extended to <sup>C</sup>, and it is zero if and only if *<sup>y</sup>* <sup>=</sup> *<sup>μ</sup>*, see Lemma 2.22. The latter is also an immediate consequence of the fact that the MLE is unique within EDFs.

*Remark 4.4* The unit deviance <sup>d</sup>*(y, μ)* has only been considered on ˚<sup>C</sup> <sup>×</sup> *<sup>M</sup>* in (2.25). Having steepness of cumulant function *<sup>κ</sup>* implies ˚<sup>C</sup> <sup>=</sup> *<sup>M</sup>*, see Theorem 2.19, and in the absolutely continuous EDF case, we always have *Yi* ∈ *M*, a.s., which makes (4.7) well-defined for all observations *Yi*, a.s. In the discrete or the mixed EDF case, an observation *Yi* can be at the boundary of *M*. In that case (4.7) must be calculated from

$$\mathfrak{d}\left(Y\_{l},\mu\right) = 2\left(\sup\_{\widetilde{\theta}\in\Theta} \left[Y\_{l}\widetilde{\theta} - \kappa\left(\widetilde{\theta}\right)\right] - Y\_{l}h\left(\mu\right) + \kappa\left(h\left(\mu\right)\right)\right). \tag{4.8}$$

This applies, e.g., to the Poisson or Bernoulli cases for observation *Yi* = 0, in these cases we obtain unit deviances 2*μ* and −2log*(*1 − *μ)*, respectively.

The previous considerations (4.7)–(4.8) have been studying one single observation *Yi* of *Yn*. Aggregating over all observations in *Y<sup>n</sup>* (and additionally using independence between the individual components of *Yn*) we arrive at the so-called *deviance loss function*

$$\mathfrak{D}(\mathbf{Y}\_n, \mu) \stackrel{\text{def.}}{=} \frac{1}{n} \sum\_{l=1}^n \frac{v\_l}{\varphi} \mathfrak{d}\left(Y\_l, \mu\right) \tag{4.9}$$

$$=\frac{2}{n}\sum\_{l=1}^{n}\frac{v\_l}{\varphi}\left(Y\_lh\left(Y\_l\right)-\kappa\left(h\left(Y\_l\right)\right)-Y\_lh\left(\mu\right)+\kappa\left(h\left(\mu\right)\right)\right) \ge 0.$$

The deviance loss function D*(Yn, μ)* subtracts twice the log-likelihood *Y<sup>n</sup> (μ)* from the one of the saturated model. Thus, it introduces a sign flip compared to (4.5). This immediately gives us the following corollary.

**Corollary 4.5 (Deviance Loss Function)** *The MLE problem* (4.5) *is equivalent to solving*

$$
\widehat{\mu}^{MLE} = \underset{\widetilde{\mu} \in \overline{\mathcal{M}}}{\text{arg}\max} \,\ell\_{\text{Y}\_n}(\widetilde{\mu}) = \underset{\widetilde{\mu} \in \overline{\mathcal{M}}}{\text{arg}\min} \,\mathfrak{D}(\mathbf{Y}\_n, \widetilde{\mu}).\tag{4.10}
$$

#### *Remarks 4.6*


$$\widehat{\theta}^{\text{MLE}} = \operatorname\*{arg\,min}\_{\widetilde{\theta} \in \Theta} \frac{1}{n} \sum\_{i=1}^{n} D\_{\text{KL}} \Big( f(\cdot; h(Y\_i), v\_i/\varphi) \Big| \Big| f(\cdot; \widetilde{\theta}, v\_i/\varphi) \Big),$$

by finding an optimal parameter *<sup>θ</sup>*MLE somewhere 'in the middle' of the observation *<sup>θ</sup>*MLE <sup>1</sup> <sup>=</sup> *h(Y*1*), . . . , <sup>θ</sup>*MLE *<sup>n</sup>* = *h(Yn)*. This then provides us with, see (2.27),

$$\prod\_{l=1}^{n} f\left(Y\_{l}; \widetilde{\boldsymbol{\theta}}, \boldsymbol{v}\_{l}/\boldsymbol{\varphi}\right) = \left[\prod\_{l=1}^{n} f\left(Y\_{l}; h\left(Y\_{l}\right), \boldsymbol{v}\_{l}/\boldsymbol{\varphi}\right)\right] e^{-\frac{1}{2}\sum\_{l=1}^{n} \frac{\boldsymbol{v}\_{l}}{\boldsymbol{\varphi}} \boldsymbol{\mathsf{o}}\left(Y\_{l}, \boldsymbol{\kappa}'\left(\widetilde{\boldsymbol{\theta}}\right)\right)} \quad \text{(4.11)}$$

$$\propto \exp\left\{-\sum\_{l=1}^{n} D\_{\text{KL}}\left(f\left(\boldsymbol{\cdot}; h(Y\_{l}), \boldsymbol{v}\_{l}/\boldsymbol{\varphi}\right) \middle| \, \left|f\left(\boldsymbol{\cdot}; \widetilde{\boldsymbol{\theta}}, \boldsymbol{v}\_{l}/\boldsymbol{\varphi}\right)\right|\right\},$$

where <sup>∝</sup> highlights that we drop all terms that do not involve\**θ*. This describes the change in joint likelihood by varying the canonical parameter \**<sup>θ</sup>* over its domain . The first line of (4.11) is in the spirit of minimizing a weighted square loss, but the Gaussian square is replaced by the unit deviance d. The second line of (4.11) is in the spirit of information geometry considered in Sect. 2.3, where we try to find a canonical parameter \**<sup>θ</sup>* that has a small KL divergence to the *<sup>n</sup>* individual models being parametrized by *h(Y*1*), . . . , h(Yn)*, thus, the MLE *<sup>θ</sup>*MLE provides an optimal balance over the entire set of (independent) observations *Y*1*,...,Yn* w.r.t. the KL divergence.


$$X^2(\mathbf{y}, \mu) = \frac{(\mathbf{y} - \mu)^2}{V(\mu)},\tag{4.12}$$

where *μ* → *V (μ)* is the variance function of the chosen EDF. Similarly, to the deviance loss function (4.9), we can aggregate these Pearson's *χ*2-statistics *X*<sup>2</sup>*(Yi, μ)* over all observations *Yi* in *Y<sup>n</sup>* to receive a second overall measure of discrepancy. In the Gaussian case the deviance loss and Pearson's *χ*2-statistic coincide and have a *χ*2-distribution, for other distributions asymptotic results are available.

In the non-Gaussian case, (4.12) is not always robust. For instance, if we work in the Poisson model, we have variance function *V (μ)* = *μ*. Our examples

#### 4.1 Generalization Loss 83

below will have low claim frequencies which implies that *μ* will be small. The appearance of a small *μ* in the denominator of (4.12) will imply that Pearson's *χ*2-statistic is not very robust in small frequency applications, in particular, if we need to estimate this *μ* from *Yn*. Therefore, we refrain from using (4.12).

Naturally, in analogy to Theorem 4.1 and derivation (4.6), the above considerations motivate us to consider expected GLs under unit deviances within the EDF. We use the decision rule *<sup>μ</sup>*MLE*(Yn)* <sup>∈</sup> <sup>A</sup> <sup>=</sup> *<sup>M</sup>* to predict a new observation *<sup>Y</sup>* .

The expected *deviance GL* is defined and given by

$$\begin{aligned} &\mathbb{E}\_{\theta}\left[\mathbb{I}\left(Y,\widehat{\mu}^{\text{MLLE}}(Y\_{n})\right)\right] \\ &=\mathbb{E}\_{\theta}\left[\mathbb{I}\left(Y,\mu\right)\right]+2\mathbb{E}\_{\theta}\left[Yh(\mu)-\kappa\left(h(\mu)\right)-Yh(\widehat{\mu}^{\text{MLLE}}(Y\_{n}))+\kappa\left(h(\widehat{\mu}^{\text{MLLE}}(Y\_{n}))\right)\right] \\ &=\mathbb{E}\_{\theta}\left[\mathbb{I}\left(Y,\mu\right)\right]+\mathcal{E}\left(\mu,\widehat{\mu}^{\text{MLLE}}(Y\_{n})\right), \end{aligned} \tag{4.13}$$

the last identity uses independence between *Y<sup>n</sup>* and *Y* , and with *estimation risk function*

$$\mathcal{E}\left(\mu, \widehat{\mu}^{\text{ML.E}}(Y\_n)\right) = \mathbb{E}\_{\theta}\left[\mathfrak{d}\left(\mu, \widehat{\mu}^{\text{ML.E}}(Y\_n)\right)\right] > 0,\tag{4.14}$$

we use steepness of the cumulant function, <sup>C</sup> <sup>=</sup> conv*(*T*)* <sup>=</sup> *<sup>M</sup>*, and Lemma 2.22 for the strict positivity of the estimation risk function. Thus, for the estimation risk function *<sup>E</sup>* we replace *<sup>Y</sup>* by *<sup>μ</sup>* in the unit deviance and the expectation <sup>E</sup>*<sup>θ</sup>* is only over the observations *Yn*. This looks like a very convincing generalization of the MSEP, however, one needs to ensure that all terms in (4.13) exist.

**Theorem 4.7 (Expected Deviance Generalization Loss)** *Assume that Y<sup>n</sup> and Y are independent and belong to the same linear EDF having the same canonical parameter <sup>θ</sup>* <sup>∈</sup> ˚ *and having strictly convex and steep cumulant function <sup>κ</sup>. Choose a predictor <sup>A</sup>* : <sup>Y</sup> <sup>→</sup> <sup>A</sup> <sup>=</sup> *<sup>M</sup>, <sup>Y</sup><sup>n</sup>* <sup>→</sup> *A(Yn) and assume that all expectations in the following formula exist. The expected deviance GL of predictor A to predict Y is given by*

$$\mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, A(Y\_n) \right) \right] \\ = \mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, \mu \right) \right] + \mathcal{E} \left( \mu, A(Y\_n) \right) \\ \geq \mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, \mu \right) \right].$$

#### *Remarks 4.8*


$$\mathbb{E}\_{\theta} \left[ \mathfrak{d}\left( Y, A(Y\_n) \right) | \, Y\_n \right] = \mathbb{E}\_{\theta} \left[ \mathfrak{d}\left( Y, \mu\right) | \, Y\_n \right] + \mathfrak{d}(\mu, A(Y\_n)) \tag{4.15}$$

$$\geq \mathbb{E}\_{\theta} \left[ \mathfrak{d}\left( Y, \mu\right) \right].$$

Thus, here we directly compare *A(Yn)* to the true parameter *μ*.

*Example 4.9 (Estimation Risk Function in the Gaussian Case)* We consider the Gaussian case with cumulant function *κ(θ )* <sup>=</sup> *<sup>θ</sup>* <sup>2</sup>*/*2 and canonical link *h(μ)* <sup>=</sup> *<sup>μ</sup>*. The estimation risk function is in the Gaussian case for a square integrable predictor *A(Yn)* given by

$$\begin{split} \mathcal{E}\left(\mu, A(Y\_n)\right) &= \mathbb{E}\_{\theta}\left[\mathfrak{d}\left(\mu, A(Y\_n)\right)\right] \\ &= 2\Big(\mu h(\mu) - \kappa\left(h(\mu)\right) - \mu \mathbb{E}\_{\theta}\left[h(A(Y\_n))\right] + \mathbb{E}\_{\theta}\left[\kappa\left(h(A(Y\_n))\right)\right]\Big) \\ &= \mu^2 - 2\mu \mathbb{E}\_{\theta}\left[A(Y\_n)\right] + \mathbb{E}\_{\theta}\left[\left(A(Y\_n)\right)^2\right] \\ &= \left(\mu - \mathbb{E}\_{\theta}\left[A(Y\_n)\right]\right)^2 + \text{Var}\_{\theta}(A(Y\_n)). \end{split}$$

These are exactly the squared bias and the estimation variance, see (4.1). Thus, in the Gaussian case, the MSEP and the expected deviance GL coincide. Moreover, adding a deterministic bias *<sup>c</sup>* <sup>∈</sup> <sup>R</sup> to *A(Yn)* increases the estimation risk function, supposed that *A(Yn)* is unbiased for *μ*. We emphasize the latter as this is an important property to have, and we refer to the next Example 4.10 for an example where this property fails to hold. -

*Example 4.10 (Estimation Risk Function in the Poisson Case)* We consider the Poisson case with cumulant function *κ(θ )* <sup>=</sup> *<sup>e</sup><sup>θ</sup>* and canonical link *h(μ)* <sup>=</sup> log*μ*. The estimation risk function is given by (subject to existence)

$$\mathcal{E}\left(\mu, A(Y\_n)\right) = 2\left(\mu \log(\mu) - \mu - \mu \mathbb{E}\_{\theta}\left[\log(A(Y\_n))\right] + \mathbb{E}\_{\theta}\left[A(Y\_n)\right]\right). \tag{4.16}$$

Assume that decision rule *A(Yn)* is non-deterministic and unbiased for *μ*. Using Jensen's inequality these assumptions imply for the estimation risk function

$$\mathcal{E}\left(\mu, A(Y\_n)\right) = 2\mu \Big(\log(\mu) - \mathbb{E}\_{\theta} \left[\log(A(Y\_n))\right] \Big) > 0.$$

We now add a small deterministic bias *<sup>c</sup>* <sup>∈</sup> <sup>R</sup> to the unbiased estimator *A(Yn)* for *μ*. This gives us estimation risk function, see (4.16) and subject to existence,

$$\mathcal{E}\left(\mu, A(Y\_n) + c\right) = 2\left(\mu \log(\mu) - \mu \mathbb{E}\_{\theta} \left[\log(A(Y\_n) + c)\right] + c\right).$$

Consider the derivative w.r.t. bias *c* in 0, we use Jensen's inequality on the last line,

$$\frac{\partial}{\partial c} \mathcal{E}\left(\mu, A(\mathbf{Y}\_n) + c\right)\Big|\_{c=0} = 2\left(-\mu \mathbb{E}\_{\theta} \left[\frac{1}{A(\mathbf{Y}\_n) + c}\right] + 1\right)\Big|\_{c=0}$$

$$= -2\mu \mathbb{E}\_{\theta} \left[\frac{1}{A(\mathbf{Y}\_n)}\right] + 2$$

$$ < -2\mu \frac{1}{\mathbb{E}\_{\theta}\left[A(\mathbf{Y}\_n)\right]} + 2 = 0. \tag{4.17}$$

Thus, the estimation risk becomes smaller if we add a small bias to the (nondeterministic) unbiased predictor *A(Yn)*. This issue has been raised in Denuit et al. [97]. Of course, this is a very unfavorable property, and it is rather different from the Gaussian case in Example 4.9. It is essentially driven by the fact that parameter estimation is based on a finite sample, which implies a strict inequality in (4.17) for the finite sample estimate *A(Yn)*. A conclusion of this example is that if we use expected deviance GLs for forecast evaluation we need to insist on having unbiased predictors. This will become especially important for more complex regression models, see Sect. 7.4.2, below.

More generally, one can prove this result of a smaller estimation risk function for a small positive bias for any EDF member with power variance function *V (μ)* <sup>=</sup> *<sup>μ</sup><sup>p</sup>* with *p* ≥ 1, see also (4.18) below. The proof uses the Fortuin–Kasteleyn–Ginibre (FKG) inequality [133] providing <sup>E</sup>*<sup>θ</sup>* [*A(Yn)*<sup>1</sup>−*p*] *<sup>&</sup>lt;* <sup>E</sup>*<sup>θ</sup>* [*A(Yn)*]E*<sup>θ</sup>* [*A(Yn)*−*p*] = *<sup>μ</sup>*E*<sup>θ</sup>* [*A(Yn)*−*p*] to receive (4.17) for power variance parameters *<sup>p</sup>* <sup>≥</sup> 1. -

#### *Remarks 4.11 (Conclusion from Examples 4.9 and 4.10 and a Further Remark)*


The next example gives the most important unit deviances in actuarial modeling.

*Example 4.12 (Unit Deviances)* We give the most prominent examples of unit deviances within the single-parameter linear EDF. We recall unit deviance (2.25)

$$\mathfrak{d}(\mathbf{y},\mu) = \mathfrak{d}\left(\mathbf{y}h(\mathbf{y}) - \kappa \left(h(\mathbf{y})\right) - \mathbf{y}h(\mu) + \kappa \left(h(\mu)\right)\right) \ge 0.$$

In Sect. 2.2 we have met the examples given in Table 4.1.

#### 4.1 Generalization Loss 87


**Table 4.1** Unit deviances of selected distributions commonly used in actuarial science

If we focus on Tweedie's distributions having power variance functions *V (μ)* = *<sup>μ</sup>p*, see Table 2.1, we get a unified expression for the unit deviances for *<sup>p</sup>* ∈ {0} ∪ *(*1*,* 2*)* ∪ *(*2*,*∞*)*

$$\mathfrak{d}(\mathbf{y},\mu) = 2\left(\mathbf{y}\frac{\mathbf{y}^{1-p} - \mu^{1-p}}{1-p} - \frac{\mathbf{y}^{2-p} - \mu^{2-p}}{2-p}\right) \tag{4.18}$$

$$= 2\left(\frac{\mathbf{y}^{2-p}}{(1-p)(2-p)} - \frac{\mathbf{y}\mu^{1-p}}{1-p} + \frac{\mu^{2-p}}{2-p}\right).$$

For the remaining power variance cases we have: *p* = 1 corresponds to the Poisson case, *p* = 2 gives the gamma case, the cases *p <* 0 do not have a steep cumulant function, and, moreover, there are no EDF models for *p* ∈ *(*0*,* 1*)*, see Theorem 2.18.

The unit deviance in the Bernoulli case is also called *binary cross-entropy*. This binary cross-entropy has a categorical generalization, called *multi-class crossentropy*. Assume we have a categorical EF with levels {1*,...,k* + 1} and corresponding probabilities *p*1*,...,pk*+<sup>1</sup> ∈ *(*0*,* 1*)* summing up to 1, see Sect. 2.1.4. We denote by *<sup>Y</sup>* <sup>=</sup> *(*1{*Y*=1}*,...,* <sup>1</sup>{*Y*=*k*+1}*)*- <sup>∈</sup> <sup>R</sup>*k*+<sup>1</sup> the indicator variable that shows which level the categorical random variable *Y* takes; *Y* is called one-hot encoding of the categorical random variable *Y* . Assume *y* is a realization of *Y* and set *μ* = *p* = *(p*1*,...,pk*+1*)*-. The categorical (multi-class) cross-entropy loss function is given by

$$\mathfrak{d}(\mathfrak{y},\mu) = \mathfrak{d}(\mathfrak{y},\mathfrak{p}) = -2\sum\_{j=1}^{k+1} \mathfrak{y}\_j \log p\_j \ge 0. \tag{4.19}$$

This cross-entropy is closely related to the KL divergence between two categorical distributions *p* and *q* on {1*,...,k* +1}. The KL divergence from *p* to *q* is given by

$$D\_{\mathrm{KL}}(q||p) = \sum\_{j=1}^{k+1} q\_j \log\left(\frac{q\_j}{p\_j}\right) = \sum\_{j=1}^{k+1} q\_j \log q\_j - \sum\_{j=1}^{k+1} q\_j \log p\_j \dots$$

If we replace the true (but unknown) distribution *q* by observation *Y* = *y* we receive unit deviance (4.19) (scaled by 2), and the MLE is obtained by minimizing this KL divergence, see also Example 3.10. -

**Outlook 4.13** In the regression modeling, below, each response *Yi* will have its own mean parameter *μi* = *μ(β, xi)* which will be a function of its covariate information *xi*, and *β* denotes a regression parameter to be estimated with MLE. In that case, we modify the deviance loss function (4.9) to

$$\mathfrak{G} \mapsto \mathfrak{D}(Y\_n, \mathfrak{G}) = \frac{1}{n} \sum\_{l=1}^n \frac{v\_l}{\varphi} \mathfrak{d}\left(Y\_l, \mu\_l\right) = \frac{1}{n} \sum\_{l=1}^n \frac{v\_l}{\varphi} \mathfrak{d}\left(Y\_l, \mu(\mathfrak{G}, \mathfrak{x}\_l)\right), \qquad (4.20)$$

and the MLE of *β* can be found by solving

$$\widehat{\mathfrak{F}}^{\text{MLE}} = \underset{\mathfrak{F}}{\text{arg min }} \mathfrak{D}(Y\_n, \mathfrak{F}).\tag{4.21}$$

If *Y* is a new response with covariate information *x* and following the same EDF as *Yn*, we will evaluate the corresponding expected scaled deviance GL given by

$$\mathbb{E}\_{\mathcal{B}}\left[\frac{v}{\varphi}\mathfrak{d}\left(Y,\mu(\widehat{\mathfrak{J}}^{\text{MLE}},\mathfrak{x})\right)\right],\tag{4.22}$$

where E*<sup>β</sup>* is the expectation under the true regression parameter *β* for *Y<sup>n</sup>* and *Y* . This will be discussed in Sect. 5.1.7, below. If we interpret *(Y, x, v)* as a random vector describing a randomly selected insurance policy from our portfolio, and being independent of *Y<sup>n</sup>* (and the corresponding covariate information *xi*, 1 ≤ *i* ≤ *n*), then *<sup>β</sup>*MLE will be independent of *(Y, <sup>x</sup>, v)*. Nevertheless, the predictor *μ( <sup>β</sup>*MLE*, <sup>x</sup>)* will introduce dependence between the chosen decision rule and *Y* through *x*, and we no longer receive the split of the expected deviance GL as stated in Theorem 4.7, for a related discussion we also refer to Remarks 7.17, below.

If we interpret *(Y, x, v)* as a randomly selected insurance policy, then the expected GL (4.22) is evaluated under the joint (portfolio) distribution of (*Y, x, v)*, and the deviance loss <sup>D</sup>*(Yn, <sup>β</sup>*MLE*)* is an (in-sample) empirical version of (4.22). -

## *4.1.3 A Decision-Theoretic Approach to Forecast Evaluation*

We present an excursion to a decision-theoretic approach to forecast evaluation. This excursion gives the theoretical foundation to the unit deviance considerations from above. This section follows Gneiting [162], Krüger–Ziegel [227] and Denuit et al. [97], and we refrain from giving complete proofs in this section. Forecast evaluation should involve consistent loss/scoring functions and proper scoring rules to encourage the forecaster to make careful assessments and honest forecasts. Consistent loss functions are also a necessary tool to receive consistency of Mestimators, we refer to Remarks 3.26.

#### **Consistency and Proper Scoring Rules**

Denote by <sup>C</sup> <sup>⊆</sup> <sup>R</sup> the convex closure of the support of a real-valued random variable *<sup>Y</sup>* , and let the action space be <sup>A</sup> <sup>=</sup> <sup>C</sup>, see also (3.1). Predictions are evaluated in terms of a loss/scoring function

$$L: \mathfrak{C} \times \mathbb{A} \to \mathbb{R}\_{+}, \qquad (\mathbf{y}, a) \mapsto L(\mathbf{y}, a) \ge 0. \tag{4.23}$$

*Remark 4.14* In (4.23) we assume that the loss function *L* is bounded below by zero. This can be an advantage in applications because it gives a calibration to the loss function. In general, this lower bound is not a necessary condition for forecast evaluation. If we drop this lower bound property, we rather call *L* (only) a scoring function. For instance, the log-likelihood log*(f (y, a))* in (3.27) plays the role of a scoring function.

The forecaster can take the position of minimizing the expected loss to choose her/his action rule. That is, subject to existence, an optimal action w.r.t. *L* is received by

$$\widehat{a} = \widehat{a}(F) = \operatorname\*{arg\,min}\_{a \in \mathbb{A}} \mathbb{E}\_F \left[ L(Y, a) \right] \\ = \operatorname\*{arg\,min}\_{a \in \mathbb{A}} \int\_{\mathfrak{C}} L(\mathfrak{y}, a) dF(\mathfrak{y}). \tag{4.24}$$

In this setup the scoring function *L(y, a)* describes the loss that the forecaster suffers if she/he uses action *<sup>a</sup>* <sup>∈</sup> <sup>A</sup> and observation *<sup>y</sup>* <sup>∈</sup> <sup>C</sup> materializes. Since we do not want to insist on uniqueness in (4.24) we rather think of set-valued functionals in this section, which may provide solutions to problems like (4.24). 1

We now reverse the line of arguments, and we start from a general set-valued functional. Denote by *F* the family of distribution functions of interest supported on C. Consider the set-valued functional

$$\mathfrak{A} : \mathcal{F} \to \mathcal{P}(\mathbb{A}), \qquad F \mapsto \mathfrak{A}(F) \subset \mathbb{A}, \tag{4.25}$$

that maps each distribution *<sup>F</sup>* <sup>∈</sup> *<sup>F</sup>* to a subset <sup>A</sup>*(F )* of the action space <sup>A</sup> <sup>=</sup> <sup>C</sup>, that is, an element of the power set *<sup>P</sup>(*A*)*. The main question that we want to study in this section is the following: can we find a loss function *L* so that the set-valued

<sup>1</sup> In fact, also for the MLE in Definition 3.4 we should consider a set-valued functional. We have decided to skip this distinction to avoid any kind of complication and to not disturb the flow of reading.

functional A is obtained by a loss minimization (4.24)? This motivates the following definition.

**Definition 4.15 (Strict Consistency)** The loss function *<sup>L</sup>* : <sup>C</sup> <sup>×</sup> <sup>A</sup> <sup>→</sup> <sup>R</sup><sup>+</sup> is consistent for the functional <sup>A</sup> : *<sup>F</sup>* <sup>→</sup> *<sup>P</sup>(*A*)* relative to the class *<sup>F</sup>* if

$$\mathbb{E}\_F\left[L(Y,\widehat{a})\right] \le \mathbb{E}\_F\left[L(Y,a)\right],\tag{4.26}$$

for all *<sup>F</sup>* <sup>∈</sup> *<sup>F</sup>*, *<sup>a</sup>* <sup>∈</sup> <sup>A</sup>*(F )* and *<sup>a</sup>* <sup>∈</sup> <sup>A</sup>. It is strictly consistent if it is consistent and equality in (4.26) implies that *<sup>a</sup>* <sup>∈</sup> <sup>A</sup>*(F )*.

As stated in Theorem 1 of Gneiting [162], a loss function *L* is consistent for the functional <sup>A</sup> relative to the class *<sup>F</sup>* if and only if, given any *<sup>F</sup>* <sup>∈</sup> *<sup>F</sup>*, every *<sup>a</sup>* <sup>∈</sup> <sup>A</sup>*(F )* is an optimal action under *L* in the sense of (4.24).

We give an example. Assume we start from the functional *<sup>F</sup>* <sup>→</sup> <sup>A</sup>*(F )* <sup>=</sup> <sup>E</sup>*<sup>F</sup>* [*<sup>Y</sup>* ] that maps each distribution *F* to its expected value. In this case we do not need to consider a set-valued functional because the expected value is a singleton (we assume that *F* only contains distributions with a finite first moment). The question then is whether we can find a loss function *L* such that this mean can be received by a minimization (4.24). This question is answered in Theorem 4.19, below.

Next we relate a consistent loss function *L* to a *proper scoring rule*. A proper scoring rule is a function *<sup>R</sup>* : <sup>C</sup> <sup>×</sup> *<sup>F</sup>* <sup>→</sup> <sup>R</sup> such that

$$\mathbb{E}\_F\left[R(Y,F)\right] \le \mathbb{E}\_F\left[R(Y,G)\right],\tag{4.27}$$

for all *F,G* ∈ *F*, supposed that the expectations are well-defined. A scoring rule *R* analyzes the penalty *R(y, G)* if the forecaster works with a distribution *G* and an observation *y* of *Y* ∼ *F* materializes. Proper scoring rules have been promoted in Gneiting–Raftery [163] and Gneiting [162]. They are important because they encourage the forecaster to make honest forecasts, i.e., it gives the forecaster the incentive to minimize the expected score by following his true belief about the true distribution, because only this minimizes the expected penalty in (4.27).

**Theorem 4.16 (Gneiting [162, Theorem 3])** *Assume that L is a consistent loss function for the functional* <sup>A</sup> *relative to the class <sup>F</sup>. For each <sup>F</sup>* <sup>∈</sup> *<sup>F</sup>, let aF* <sup>∈</sup> <sup>A</sup>*(F ). The scoring rule*

$$\mathcal{R}: \mathfrak{C} \times \mathcal{F} \to \mathbb{R}, \qquad (\mathbf{y}, F) \mapsto R(\mathbf{y}, F) = L(\mathbf{y}, a\_F),$$

*is a proper scoring rule.*

*Example 4.17* Consider the unit deviance <sup>d</sup> *(*·*,*·*)* : <sup>C</sup> <sup>×</sup> *<sup>M</sup>* <sup>→</sup> <sup>R</sup><sup>+</sup> for a given EDF *<sup>F</sup>* = {*F (*·; *θ , v/ϕ)*; *<sup>θ</sup>* <sup>∈</sup> ˚} with cumulant function *<sup>κ</sup>*. Lemma 2.22 says that under suitable assumptions this unit deviance <sup>d</sup> *(y,μ)* is zero if and only if *<sup>y</sup>* <sup>=</sup> *<sup>μ</sup>*. We consider the mean functional on *F*

$$\mathfrak{A} : \mathcal{F} \to \mathbb{A} = \mathcal{M}, \qquad F\_{\theta} = F(\cdot; \theta, \upsilon/\varphi) \mapsto \mathfrak{A}(F\_{\theta}) = \mu(\theta),$$

where *μ* = *μ(θ )* = *κ (θ )* is the mean of the chosen EDF. Choosing the unit deviance as loss function we receive for any action *<sup>a</sup>* <sup>∈</sup> <sup>A</sup>, see (4.13),

$$\mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, a \right) \right] = \mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, \mu \right) \right] + 2 \, \mathbb{E}\_{\theta} \left[ Y h(\mu) - \kappa \left( h(\mu) \right) - Y h(a) + \kappa \left( h(a) \right) \right]$$

$$= \mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, \mu \right) \right] + 2 \left( \mu h(\mu) - \kappa \left( h(\mu) \right) - \mu h(a) + \kappa \left( h(a) \right) \right)$$

$$= \mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, \mu \right) \right] + \mathfrak{d} \left( \mu, a \right).$$

This is minimized for *a* = *μ* and it proves that the unit deviance is strictly consistent for the mean functional <sup>A</sup> : *Fθ* <sup>→</sup> <sup>A</sup>*(Fθ )* <sup>=</sup> *μ(θ )* relative to the chosen EDF *<sup>F</sup>* = {*F (*·; *θ , v/ϕ)*; *<sup>θ</sup>* <sup>∈</sup> ˚}. Using Theorem 4.16, the scoring rule

$$\mathcal{R}: \mathfrak{C} \times \mathcal{F} \to \mathbb{R}, \qquad (\mathbf{y}, F\_{\theta}) \mapsto \mathcal{R}(\mathbf{y}, F\_{\theta}) = \mathfrak{d}(\mathbf{y}, \mu(\theta)),$$

is a strictly proper scoring rule, that is,

$$\mathbb{E}\_{\theta} \left[ R(Y, F\_{\theta}) \right] = \mathbb{E}\_{\theta} \left[ \mathfrak{d}(Y, \mu(\theta)) \right] \\ \quad < \mathbb{E}\_{\theta} \left[ \mathfrak{d}(Y, \mu(\widetilde{\theta})) \right] = \mathbb{E}\_{\theta} \left[ R(Y, F\_{\widetilde{\theta}}) \right],$$

for any \**<sup>θ</sup>* = *<sup>θ</sup>*. We conclude from this small example that the unit deviance is a strictly consistent loss function for the mean functional on the chosen EDF, and this provides us with a strictly proper scoring rule. -

In the above Example 4.17 we have chosen the mean functional

$$\mathfrak{A} : \mathcal{F} \to \mathbb{A} = \mathcal{M}, \qquad F\_{\theta} = F(\cdot; \theta, \,\upsilon/\varphi) \mapsto \mathfrak{A}(F\_{\theta}) = \mu(\theta),$$

within a given EDF *<sup>F</sup>* = {*F (*·; *θ , v/ϕ)*; *<sup>θ</sup>* <sup>∈</sup> ˚}. We have seen that


$$\mathbb{E}\_{\theta} \left[ \mathfrak{d}(Y, \mu(\theta)) \right] \quad \text{<} \quad \mathbb{E}\_{\theta} \left[ \mathfrak{d}(Y, \mu(\widetilde{\theta})) \right],$$

for any \**<sup>θ</sup>* = *<sup>θ</sup>*.

The consideration of the mean functional *<sup>F</sup>* <sup>→</sup> <sup>A</sup>*(F )* <sup>=</sup> <sup>E</sup>*<sup>F</sup>* [*<sup>Y</sup>* ] in Example 4.17 is motivated by the fact that we typically forecast random variables by their means. However, more generally, we may ask the question for which functionals <sup>A</sup> : *<sup>F</sup>* <sup>→</sup> *<sup>P</sup>(*A*)*, relative to a given set of distributions *<sup>F</sup>*, there exists a loss function *<sup>L</sup>* that is strictly consistent.

**Definition 4.18 (Elicitable)** The functional A is elicitable relative to a given set of distributions *<sup>F</sup>* if there exists a loss function *<sup>L</sup>* that is strictly consistent for <sup>A</sup> and *F*.

Above we have seen that the mean functional is elicitable relative to the EDF using the unit deviance loss; expected values relative to *F* with finite second moments are also elicitable using the square loss function. Savage [327] more generally identifies the Bregman divergences as being the only consistent scoring functions for the mean functional; recall that the unit deviance is a special case of a Bregman divergence, see (2.29). We are going to state the corresponding result.

For a general loss function *L* we make the following (standard) assumptions:


This then allows us to cite the following theorem.

**Theorem 4.19 (Gneiting [162, Theorem 7])** *Let F be the class of distributions on an interval* <sup>C</sup> <sup>⊆</sup> <sup>R</sup> *having finite first moments.*

• *Assume the loss function <sup>L</sup>* : <sup>C</sup>×<sup>A</sup> <sup>→</sup> <sup>R</sup> *satisfies (L0)–(L2) for interval* <sup>C</sup> <sup>=</sup> <sup>A</sup> <sup>⊆</sup> <sup>R</sup>*. <sup>L</sup> is consistent for the mean functional relative to the class <sup>F</sup> of compactly supported distributions on* C *if and only if the loss function L is of Bregman divergence form*

$$D\_{\psi}(\mathbf{y}, a) = \psi(\mathbf{y}) - \psi(a) - \psi'(a)(\mathbf{y} - a),$$

*for a convex function ψ with (sub-)gradient ψ on* C*.*

• *If ψ is strictly convex on* C*, then the Bregman divergence Dψ is strictly consistent for the mean functional relative to the class <sup>F</sup> on* <sup>C</sup> *for which both* <sup>E</sup>*<sup>F</sup>* [*<sup>Y</sup>* ] *and* <sup>E</sup>*<sup>F</sup>* [*ψ(Y )*] *exist and are finite.*

Theorem 4.19 tells us that Bregman divergences are the only consistent loss functions for the mean functional (under some additional assumptions). Consider the specific choice *ψ(a)* <sup>=</sup> *<sup>a</sup>*2*/*2 which is a strictly convex function. For this choice, the Bregman divergence is the square loss function *Dψ (y, a)* <sup>=</sup> *(y* <sup>−</sup> *a)*2*/*2, which is strictly consistent for the mean functional relative to the class *<sup>F</sup>* <sup>⊂</sup> *<sup>L</sup>*2*(*P*)*. We remark that also quantiles are elicitable, the corresponding result is going to be stated in Theorem 5.33, below.

The second bullet point of Theorem 4.19 immediately implies that the unit deviance <sup>d</sup>*(*·*,*·*)* is a strictly consistent loss function for the mean functional within the chosen EDF, see also (2.29) and Example 4.17. In particular, for *<sup>θ</sup>* <sup>∈</sup> ˚

$$\mu = \mu(\theta) = \operatorname\*{arg\,min}\_{a \in \mathcal{M}} \mathbb{E}\_{\theta} \left[ \mathfrak{d}(Y, a) \right]. \tag{4.28}$$

#### 4.1 Generalization Loss 93

Explicit evaluation of (4.28) requires that the true distribution *Fθ* of *Y* is known. Since, typically, this is not the case, we need to evaluate it empirically. Assume that the random variables *Yi* are independent and *Fθ* distributed, with *Fθ* belonging to the fixed EDF providing the corresponding unit deviance d. Then, the objective function in (4.28) is approximated by, a.s.,

$$\mathfrak{D}(Y\_n, a) = \frac{1}{n} \sum\_{l=1}^n \frac{v\_l}{\varphi} \mathfrak{d}(Y\_l, a) \to \mathbb{E}\_{\theta} \left[ \frac{v}{\varphi} \mathfrak{d}(Y, a) \right] \qquad \text{as } n \to \infty. \tag{4.29}$$

The convergence statement follows from the strong law of large numbers applied to the i.i.d. random variables *(Yi, vi)*, *i* ≥ 1, and supposed that the right-hand side of (4.29) exists. Thus, the deviance loss function (4.9) is an empirical version of the expected deviance loss function, and this approach is successful if we can exchange the 'argmin' operator of (4.28) and the limit *n* → ∞ in (4.29). This closes the circle and brings us back to the M-estimator considered in Remarks 3.26 and 3.29, and which also links forecast evaluation and M-estimation.

#### **Forecast Dominance**

A consequence of Theorem 4.19 is that there are infinitely many strictly consistent loss functions for the mean functional, and, in principle, we could choose any of these for forecast evaluation. Choosing the unit deviance d that matches the distribution *Fθ* of the observations *<sup>Y</sup><sup>n</sup>* and *<sup>Y</sup>* , respectively, gives us the MLE *<sup>μ</sup>*MLE, and we have seen that the MLE *<sup>μ</sup>*MLE is not only unbiased for *<sup>μ</sup>* <sup>=</sup> *<sup>κ</sup> (θ )*, but it also meets the Cramér–Rao information bound. That is, it is UMVU within the data generating model reflected by the true unit deviance d. This provides us (in the finite sample case) with a natural candidate for d in (4.29) and, thus, a canonical proper scoring rule for (out-of-sample) forecast evaluation.

The previous statements have all been done under the assumption that there is no uncertainty about the underlying family of distribution functions that generates *Y* and *Yn*, respectively. Uncertainty was limited to the true canonical parameter *θ* and the true mean *μ(θ )*. This situation changes under model uncertainty. Krüger– Ziegel [227] study the question of having multiple strictly consistent loss functions in the situation where there is no natural candidate choice. Different choices may give different rankings to different (finite sample) predictors. Assume we have two predictors *<sup>μ</sup>*<sup>1</sup> and *<sup>μ</sup>*<sup>2</sup> for a random variable *<sup>Y</sup>* . Similarly to the definition of the expected deviance GL, we understand these predictors *<sup>μ</sup>*<sup>1</sup> and *<sup>μ</sup>*<sup>2</sup> as random variables, and we assume that all considered random variables have a finite first moment. Importantly, we do not assume independence between *<sup>μ</sup>*1, *<sup>μ</sup>*<sup>2</sup> and *<sup>Y</sup>* , and in regression models we typically receive dependence between predictors *<sup>μ</sup>* and random variables *Y* through the features (covariates) *x*, see also Outlook 4.13. Following Krüger–Ziegel [227] and Ehm et al. [119] we define *forecast dominance* as follows.

**Definition 4.20 (Forecast Dominance)** Predictor *<sup>μ</sup>*<sup>1</sup> dominates predictor *<sup>μ</sup>*<sup>2</sup> if

$$\mathbb{E}\left[D\_{\psi}\left(Y,\widehat{\mu}\_{1}\right)\right] \le \mathbb{E}\left[D\_{\psi}\left(Y,\widehat{\mu}\_{2}\right)\right],$$

for all Bregman divergences *Dψ* with (convex) *ψ* supported on C, the latter being the convex closure of the supports of *<sup>Y</sup>* , *<sup>μ</sup>*<sup>1</sup> and *<sup>μ</sup>*2.

If we work with a fixed member of the EDF, e.g., the gamma distribution, then we typically study the corresponding expected deviance GL for forecast evaluation in one single model, see Theorem 4.7 and (4.29). This evaluation may involve model risk in the decision making process, and forecast dominance provides a robust selection criterion.

Krüger–Ziegel [227] build on Theorem 1b and Corollary 1b of Ehm et al. [119] to prove the following theorem (which prevents from considering all convex functions *ψ*).

**Theorem 4.21 (Theorem 2.1 of Krüger–Ziegel [227])** *Predictor <sup>μ</sup>*<sup>1</sup> *dominates predictor <sup>μ</sup>*<sup>2</sup> *if and only if for all <sup>τ</sup>* <sup>∈</sup> <sup>C</sup>

$$\mathbb{E}\left[\left(Y-\mathfrak{r}\right)\mathbb{1}\_{\{\widehat{\mu}\_{1}>\mathfrak{r}\}}\right] \geq \mathbb{E}\left[\left(Y-\mathfrak{r}\right)\mathbb{1}\_{\{\widehat{\mu}\_{2}>\mathfrak{r}\}}\right].\tag{4.30}$$

Denuit et al. [97] argue that in insurance one typically works with Tweedie's distributions having power variances *V (μ)* <sup>=</sup> *<sup>μ</sup><sup>p</sup>* with power variance parameters *p* ≥ 1. This motivates the following weaker form of forecast dominance.

**Definition 4.22 (Tweedie's Forecast Dominance)** Predictor *<sup>μ</sup>*<sup>1</sup> Tweediedominates predictor *<sup>μ</sup>*<sup>2</sup> if

$$\mathbb{E}\left[\mathfrak{d}\_p(Y,\widehat{\mu}\_1)\right] \le \mathbb{E}\left[\mathfrak{d}\_p(Y,\widehat{\mu}\_2)\right],$$

for all Tweedie's unit deviances <sup>d</sup>*<sup>p</sup>* with power variance parameters *<sup>p</sup>* <sup>≥</sup> 1, we refer to (4.18) for *p* ∈ *(*1*,*∞*)* \ {2} and Table 4.1 for the Poisson and gamma cases *p* ∈ {1*,* 2}.

Recall that Tweedie's unit deviances d*<sup>p</sup>* are a subclass of Bregman divergences, see (2.29). Define the following function for power variance parameters *p* ≥ 1

$$\mathcal{T}\_p(\mu) = \begin{cases} \log \mu & \text{for } p = 2, \\ \frac{\mu^{2-p}}{2-p} & \text{otherwise.} \end{cases}$$

Denuit et al. [97] prove the following proposition.

**Proposition 4.23 (Proposition 4.1 of Denuit et al. [97])** *Predictor <sup>μ</sup>*<sup>1</sup> *Tweediedominates predictor <sup>μ</sup>*<sup>2</sup> *if*

$$\mathbb{E}\left[\mathcal{T}\_p(\widehat{\mu}\_1)\right] \le \mathbb{E}\left[\mathcal{T}\_p(\widehat{\mu}\_2)\right] \qquad \text{for all } p \ge 1,$$

*and*

$$\mathbb{E}\left[Y\mathbb{1}\_{\{\widehat{\mu}\_1 > \mathfrak{r}\}}\right] \ge \mathbb{E}\left[Y\mathbb{1}\_{\{\widehat{\mu}\_2 > \mathfrak{r}\}}\right] \qquad \text{ for all } \mathfrak{r} \in \mathfrak{C}.$$

Theorem 4.21 gives necessary and sufficient conditions to have forecast dominance, Proposition 4.23 gives sufficient conditions to have the weaker Tweedie's forecast dominance. In Theorem 7.15, below, we give another characterization of forecast dominance in terms of convex orders, under the additional assumption that the predictors are so-called auto-calibrated.

## **4.2 Cross-Validation**

This section focuses on estimating the expected deviance GL (4.13) in cases where the canonical parameter *θ* is not known. Of course, the same concepts apply to the MSEP. In the remainder of this section we scale the unit deviances with *v/ϕ*, to bring them in line with the deviance loss (4.9).

## *4.2.1 In-Sample and Out-of-Sample Losses*

The general aim in predictive modeling is to predict an unobserved random variable *Y* as good as possible based on past information *Yn*. Within the EDF, the predictive performance is then evaluated under an empirical version of the expected deviance GL

$$\mathbb{E}\_{\theta} \left[ \frac{v}{\varphi} \mathfrak{d}\left( Y, A(Y\_n) \right) \right] = 2 \mathbb{E}\_{\theta} \left[ \frac{v}{\varphi} \left( Y h(Y) - \kappa \left( h(Y) \right) - Y h(A(Y\_n)) + \kappa \left( h(A(Y\_n)) \right) \right) \right]. \tag{4.31}$$

Here, we no longer assume that *Y* and *A(Yn)* are independent, and in the dependent case Theorem 4.7 does not apply. The reason for dropping the independence assumption is that below we consider regression models of a similar type as in Outlook 4.13. The expected deviance GL (4.31) as such is not directly useful because it cannot be calculated if the true canonical parameter *θ* is not known. Therefore, we are going to explain how it can be estimated empirically.

We start from the expected deviance GL in the EDF applied to the MLE decision rule *<sup>μ</sup>*MLE*(Yn)*. It can be rewritten as

$$\mathbb{E}\_{\theta} \left[ \frac{v}{\varphi} \mathfrak{d}\left( Y, \widehat{\mu}^{\text{ML.E}}(Y\_n) \right) \right] = \int \mathbb{E}\_{\theta} \left[ \frac{v}{\varphi} \mathfrak{d}\left( Y, \widehat{\mu}^{\text{ML.E}}(Y\_n) \right) \right] Y\_n = \mathfrak{y}\_n \right] dP(\mathfrak{y}\_n; \theta), \tag{4.32}$$

where we use the tower property for conditional expectations. In view of (4.32), there are two things to be done:

(1) For given observations *Y<sup>n</sup>* = *yn*, we need to estimate the deviance GL, see also (4.15),

$$\mathbb{E}\_{\theta} \left[ \frac{v}{\varphi} \mathfrak{d}\left( Y, \widehat{\mu}^{\text{MLE}}(Y\_n) \right) \bigg| Y\_n = \mathfrak{y}\_n \right] = \mathbb{E}\_{\theta} \left[ \frac{v}{\varphi} \mathfrak{d}\left( Y, \widehat{\mu}^{\text{MLE}}(\mathfrak{y}\_n) \right) \bigg| Y\_n = \mathfrak{y}\_n \right]. \tag{4.33}$$

This is the part that we are going to solve empirically in the this section. Typically, we assume that *Y* and *Y<sup>n</sup>* are independent, nevertheless, *Y* and its MLE predictor may still be dependent because we may have a predictor *<sup>μ</sup>*MLE*(Yn)* <sup>=</sup> *<sup>μ</sup>*MLE*(Yn, <sup>x</sup>)*. That is, this predictor often depends on covariate information *x* that describes *Y* , an example is provided in (4.22) of Outlook 4.13 and this is different from (4.15). In that case, the decision rule *<sup>A</sup>* : <sup>Y</sup> <sup>×</sup> *<sup>X</sup>* <sup>→</sup> <sup>A</sup> is extended by an additional covariate component *x* ∈ *X*, we refer to Sect. 5.1.1, where *X* is introduced and discussed.

(2) We have to find a way to generate more observations *Y<sup>n</sup>* from *P (yn*; *θ )* in order to evaluate the outer integral in (4.32) empirically. One way to do so is the bootstrap method that is going to be discussed in Sect. 4.3, below.

We address the first problem of estimating the deviance GL given in (4.33). We do this under the assumption that *Y<sup>n</sup>* and *Y* are independent. In order to estimate (4.33) we need observations for *Y* . However, typically, there are no observations available for this random variable because it is only going to be observed in the future. For this reason, one uses past observations for both, model fitting and the GL analysis. In order to perform this analysis in a proper way, the general paradigm is to partition the entire data into two *disjoint* data sets, a socalled *learning data set <sup>L</sup>* = {*Y*1*,...,Yn*} and a *test data set <sup>T</sup>* = {*<sup>Y</sup>* † <sup>1</sup> *,...,Y* † *T* }. If we assume that all observations in *L* ∪ *T* are independent, then we receive a suitable observation *Y<sup>n</sup>* from the learning data set *L* that can be used for model fitting. The test sample *T* can then play the role of the unobserved random variable *Y* (by assumption being independent of *Yn*). Note that *L* is *only* used for model fitting and *T* is *only* used for the deviance GL evaluation, see Fig. 4.1.

This setup motivates to estimate the mean parameter *<sup>μ</sup>* with MLE *<sup>μ</sup>*MLE *<sup>L</sup>* = *<sup>μ</sup>*MLE*(Yn)* from the learning data *<sup>L</sup>* and *<sup>Y</sup>n*, respectively, by minimizing the deviance loss function *<sup>μ</sup>* <sup>→</sup> <sup>D</sup>*(Yn, μ)* on the learning data *<sup>L</sup>*, according to Corollary 4.5. Then we use this predictor *<sup>μ</sup>*MLE *<sup>L</sup>* to empirically evaluate the conditional expectation in (4.33) on *T* . The perception used is that we *(in-sample) learn a model* on *L* and we *out-of-sample test this model* on *T* to see how it generalizes to unobserved variables *Y* † *<sup>t</sup>* , 1 ≤ *t* ≤ *T* , that are of a similar nature as *Y* .

**Fig. 4.1** Partition of entire data into learning data set *L* and test data set *T*

**Definition 4.24 (In-Sample and Out-of-Sample Losses)** The *in-sample deviance loss* on the learning data *L* = {*Y*1*,...,Yn*} is given by

$$\mathfrak{D}(\mathcal{L}, \widehat{\mu}\_{\mathcal{L}}^{\text{MLLE}}) = \frac{2}{n} \sum\_{i=1}^{n} \frac{\upsilon\_{i}}{\varphi} \left( Y\_{i} h\left(Y\_{i}\right) - \kappa\left(h\left(Y\_{i}\right)\right) - Y\_{i} h\left(\widehat{\mu}\_{\mathcal{L}}^{\text{MLLE}}\right) + \kappa\left(h(\widehat{\mu}\_{\mathcal{L}}^{\text{MLLE}})\right) \right),$$

with MLE *<sup>μ</sup>*MLE *<sup>L</sup>* <sup>=</sup> *<sup>μ</sup>*MLE*(Yn)* on *<sup>L</sup>*.

The out-of-sample deviance loss on the test data *<sup>T</sup>* = {*<sup>Y</sup>* † <sup>1</sup> *,...,Y* † *<sup>T</sup>* } of predictor *<sup>μ</sup>*MLE *<sup>L</sup>* is

$$\mathfrak{D}(\mathcal{T}, \widehat{\mu}\_{\mathcal{L}}^{\text{MLLE}}) = \frac{2}{T} \sum\_{t=1}^{T} \frac{v\_t^\dagger}{\varphi} \left( Y\_t^\dagger h\left(Y\_t^\dagger\right) - \kappa \left( h\left(Y\_t^\dagger\right) \right) - Y\_t^\dagger h\left(\widehat{\mu}\_{\mathcal{L}}^{\text{MLLE}}\right) + \kappa \left( h(\widehat{\mu}\_{\mathcal{L}}^{\text{MLLE}}) \right) \right),$$

where the sum runs over the test sample *<sup>T</sup>* having exposures *<sup>v</sup>*† <sup>1</sup> *,...,v*† *<sup>T</sup> >* 0.

For MLE we minimize the objective function (4.9), therefore, the in-sample deviance loss <sup>D</sup>*(L, <sup>μ</sup>*MLE *<sup>L</sup> )* <sup>=</sup> <sup>D</sup>*(Yn, <sup>μ</sup>*MLE*(Yn))* exactly corresponds to the minimal deviance loss (4.9) achieved on the learning data *L*, i.e., when using MLE *<sup>μ</sup>*MLE *<sup>L</sup>* <sup>=</sup> *<sup>μ</sup>*MLE*(Yn)*. We call this *in-sample* because the *same* data *<sup>L</sup>* is used for parameter estimation and deviance loss calculation. Typically, this loss is biased because it uses the optimal (in-sample) parameter estimate, we also refer to Sect. 4.2.3, below.

The out-of-sample loss <sup>D</sup>*(<sup>T</sup> , <sup>μ</sup>*MLE *<sup>L</sup> )* then empirically estimates the inner expectation in (4.32). This is a proper out-of-sample analysis because the test data *T* is disjoint from the learning data *<sup>L</sup>* on which the decision rule *<sup>μ</sup>*MLE *<sup>L</sup>* has been trained. Note that this out-of-sample figure reflects (4.33) in the following sense. We have a portfolio of risks *(Y* † *<sup>t</sup> , v*† *<sup>t</sup> )*, 1 ≤ *t* ≤ *T* , and (4.33) does not only reflect the calculation of the deviance GL of a given risk, but also the random selection of a risk from the portfolio. In this sense, (4.33) is an average over a given portfolio whose description is also included in the probability P*<sup>θ</sup>* .

**Summary 4.25** Definition 4.24 gives the general principle in predictive modeling according to which model learning and the generalization analysis are done. Namely, based on two disjoint and independent data sets *L* and *T* , we perform model calibration on *L*, and we analyze (conditional) GLs (using out-of-sample losses) on *T* , respectively. For this concept to be useful, the learning data *L* and the test data *T* have to be sufficiently similar, i.e., ideally coming from the same model.

This approach does not estimate the outer expectation in the expected deviance GL (4.32), i.e., it is only an estimate for the deviance GL, given *Yn*, see (4.33).

## *4.2.2 Cross-Validation Techniques*

In many applications one is not in the comfortable situation of having two sufficiently large data sets *L* and *T* available to support model learning and an out-of-sample generalization analysis. That is, we are usually equipped with only one data set of average size, let us call it *D*. In order to calculate the objects in Definition 4.24 we could partition this data set (at random) into two data sets and then calculate in-sample and out-of-sample deviance losses on this partition. The disadvantage of this approach is that it is an inefficient use of information if only little data is available. In that case we require (almost) all data for learning. However, we still need a sufficiently large share of data for testing, to receive reliable deviance GL estimates for (4.33). The classical approach in this situation is to use crossvalidation for estimating out-of-sample losses. The concept works as follows:


(continued)

is (only) done for *estimating* the deviance GL of the model learned on all data. I.e. for prediction we work with MLE *<sup>μ</sup>*MLE *L*=*D*, but the out-of-sample deviance loss is estimated using this data in a different way.

The three most commonly used methods are leave-one-out, *K*-fold and stratified *K*-fold cross-validation. We briefly describe these three cross-validation methods.

#### **Leave-One-Out Cross-Validation**

Denote all available data by *D* = {*Y*1*,...,Yn*}, and assume independence between the components. For leave-one-out (loo) cross-validation we select 1 ≤ *i* ≤ *n* and define the partition *L(*−*i)* = *D* \ {*Yi*} for the learning data and *T<sup>i</sup>* = {*Yi*} for the test data. Based on the learning data *L(*−*i)* we calculate the MLE

$$
\widehat{\mu}^{(-i)} \overset{\text{def.}}{=} \widehat{\mu}^{\text{MLE}}\_{\mathcal{L}\_{(-i)}},
$$

which is based on all data except observation *Yi*. This observation is now used to do an out-of-sample analysis, and averaging this over all 1 ≤ *i* ≤ *n* we receive the *leave-one-out cross-validation loss*

$$\widehat{\mathfrak{D}}^{\text{loc}} = \frac{1}{n} \sum\_{i=1}^{n} \frac{v\_{i}}{\varphi} \mathfrak{d}\left(Y\_{i}, \widehat{\mu}^{(-l)}\right) \\ = \frac{1}{n} \sum\_{i=1}^{n} \mathfrak{D}\left(\widetilde{\gamma}\_{i}, \widehat{\mu}^{(-l)}\right) \\ \tag{4.34}$$

$$= \frac{2}{n} \sum\_{i=1}^{n} \frac{v\_{i}}{\varphi} \left(Y\_{i}h\left(Y\_{i}\right) - \kappa\left(h\left(Y\_{i}\right)\right) - Y\_{i}h\left(\widehat{\mu}^{(-l)}\right) + \kappa\left(h\left(\widehat{\mu}^{(-l)}\right)\right)\right).$$

where <sup>D</sup>*(Ti, <sup>μ</sup>(*−*i))* is the (out-of-sample) *cross-validation loss* on *<sup>T</sup><sup>i</sup>* = {*Yi*} using the predictor *<sup>μ</sup>(*−*i)*. This leave-one-out cross-validation loss <sup>D</sup> loo is now used as estimate for the out-of-sample deviance loss <sup>D</sup>*(<sup>T</sup> , <sup>μ</sup>*MLE *<sup>L</sup> )*. Leave-one-out crossvalidation uses all data *D* for learning and testing, namely, the data *D* is partitioned into a learning set *L(*−*i)* for (partial) learning and a test set *T<sup>i</sup>* = {*Yi*} for an outof-sample generalization analysis. This is done for all instances 1 ≤ *i* ≤ *n*, and the out-of-sample loss is estimated by the resulting average cross-validation loss. This averaging allows us to not only understand (4.34) as a conditional out-of-sample loss in the spirit of Definition 4.24. The outer empirical average in (4.34) also makes it suitable for an expected deviance GL estimate according to (4.32).

The variance of this empirical deviance GL is given by (subject to existence)

$$\mathrm{Var}\_{\theta}\left(\widehat{\mathfrak{D}}^{\mathrm{loc}}\right) = \frac{1}{n^2} \sum\_{i=1}^{n} \sum\_{j=1}^{n} \mathrm{Cov}\_{\theta}\left(\frac{\upsilon\_{i}}{\varphi}\mathfrak{d}\left(Y\_{i},\widehat{\mu}^{(-i)}\right),\frac{\upsilon\_{j}}{\varphi}\mathfrak{d}\left(Y\_{j},\widehat{\mu}^{(-j)}\right)\right).$$


**Fig. 4.2** Partitions of *K*-fold cross-validation for *K* = 5

These covariances use exactly the same observations on *D* \ {*Yi, Yj* }, therefore, there are strong correlations between the estimators *<sup>μ</sup>(*−*i)* and *<sup>μ</sup>(*−*j )*. In addition, the leave-one-out cross-validation is often computationally not feasible because it requires fitting the model *n* times, which in the situation of complex models and of large insurance portfolios can be too demanding. We come back to this in Sect. 5.6 where we provide the generalized cross-validation (GCV) loss approximation within generalized linear models (GLMs).

#### *K***-Fold Cross-Validation**

Choose a fixed integer *K* ≥ 2 and partition the entire data *D* at random into *K* disjoint subsets (called folds) *L*1*,...,L<sup>K</sup>* of approximately the same size. The learning data for fixed 1 ≤ *k* ≤ *K* is then defined by *L*[−*k*] = *D* \ *L<sup>k</sup>* and the test data by *T<sup>k</sup>* = *L<sup>k</sup>* , see Fig. 4.2. Based on learning data *L*[−*k*] we calculate the MLE

$$
\widehat{\mu}^{[-k]} \stackrel{\text{def.}}{=} \widehat{\mu}^{\mathsf{MLE}}\_{\mathcal{L}\_{[-k]}},
$$

which is based on all data except *Tk*.

These observations are now used to do an (out-of-sample) cross-validation analysis, and averaging this over all 1 ≤ *k* ≤ *K* we receive the *K*-*fold crossvalidation (CV) loss*.

$$\begin{split} \widehat{\mathfrak{D}}^{\text{CV}} &= \frac{1}{K} \sum\_{k=1}^{K} \mathfrak{D}\left(\tau\_{k}, \widehat{\mu}^{[-k]}\right) \\ &= \frac{1}{K} \sum\_{k=1}^{K} \frac{1}{|\mathcal{T}\_{k}|} \sum\_{Y\_{l} \in \mathcal{T}\_{k}} \frac{\upsilon\_{l}}{\varphi} \bullet \left(Y\_{l}, \widehat{\mu}^{[-k]}\right) \\ &\approx \frac{1}{n} \sum\_{k=1}^{K} \sum\_{Y\_{l} \in \mathcal{T}\_{k}} \frac{\upsilon\_{l}}{\varphi} \bullet \left(Y\_{l}, \widehat{\mu}^{[-k]}\right). \end{split} \tag{4.35}$$

The last step is an approximation because not all *T<sup>k</sup>* may have exactly the same sample size if *n* is not a multiple of *K*. We can understand (4.35) not only as a conditional out-of-sample loss estimate in the spirit of Definition 4.24. The outer empirical average in (4.35) also makes it suitable for an expected deviance GL estimate according to (4.32). The variance of this empirical deviance GL is given by (subject to existence)

$$\mathrm{Var}\_{\theta}\left(\widehat{\mathfrak{D}}^{\mathrm{CV}}\right) \approx \frac{1}{n^{2}} \sum\_{k,l=1}^{K} \sum\_{\begin{subarray}{c} Y\_{l} \in \mathcal{T}\_{k} \ \mathcal{Y}\_{f} \in \widetilde{\mathcal{T}}\_{l} \end{subarray}} \mathrm{Cov}\_{\theta}\left(\frac{\upsilon\_{l}}{\varphi} \mathfrak{d}\left(Y\_{l}, \widehat{\mu}^{[-k]}\right), \frac{\upsilon\_{j}}{\varphi} \mathfrak{d}\left(Y\_{j}, \widehat{\mu}^{[-l]}\right)\right).$$

Typically, in applications, one uses *K*-fold cross-validation with *K* = 10.

#### **Stratified** *K***-Fold Cross-Validation**

A disadvantage of the above *K*-fold cross-validation is that it may happen that there are two outliers in the data, and there is a positive probability that these two outliers belong to the same subset *Lk*. This may substantially distort *K*-fold cross-validation because in that case the subsets *Lk*, 1 ≤ *k* ≤ *K*, are of different quality. Stratified *K*fold cross-validation aims at distributing outliers more equally across the partition. Order the observations *Yi*, 1 ≤ *i* ≤ *n*, as follows

$$Y\_{(1)} \ge Y\_{(2)} \ge \dots \ge Y\_{(n)}.$$

For stratified *K*-fold cross-validation, we randomly distribute (partition) the *K* biggest claims *Y(*1*),...,Y(K)* to the subsets *Lk*, 1 ≤ *k* ≤ *K*, then we randomly partition the next *K* biggest claims *Y(K*+<sup>1</sup>*),...,Y(*2*K)* to the subsets *Lk*, 1 ≤ *k* ≤ *K*, and so forth. This implies, e.g., that the two biggest claims cannot fall into the same set *Lk*. This stratified partition *Lk*, 1 ≤ *k* ≤ *K*, is then used for *K*-fold crossvalidation.

#### **Summary 4.26 (Cross-Validation)**


*Example 4.27 (Out-of-Sample Deviance Loss Estimation)* We consider a claim counts example using the Poisson EDF model. The claim counts *Ni* and exposures *vi >* 0 used come from the French motor insurance data given in Listing 13.2 of Chap. 13.1. We model the claim frequencies *Yi* = *Ni/vi* with the Poisson EDF model having cumulant function *κ(θ )* = exp{*θ*} and dispersion parameter *ϕ* = 1 for all 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*. The expected frequency is given by *<sup>μ</sup>* <sup>=</sup> <sup>E</sup>*<sup>θ</sup>* [*Yi*] = *<sup>κ</sup> (θ )*. Moreover, we assume that all claim counts *Ni*, 1 ≤ *i* ≤ *n*, are independent. This provides us with the Poisson deviance loss function for observations *Y<sup>n</sup>* = *(Y*1*,...,Yn)*-, see Example 4.12,

$$\begin{aligned} \mathfrak{D}(Y\_n, \mu) &= \frac{1}{n} \sum\_{l=1}^n v\_l \mathfrak{d}(Y\_l, \mu) = \frac{1}{n} \sum\_{l=1}^n 2v\_l \left(\mu - Y\_l - Y\_l \log\left(\frac{\mu}{Y\_l}\right)\right) \\ &= \frac{1}{n} \sum\_{l=1}^n 2\left(v\_l \mu - N\_l - N\_l \log\left(\frac{v\_l \mu}{N\_l}\right)\right) \ge 0, \end{aligned}$$

where, for *Yi* <sup>=</sup> 0, we set <sup>d</sup>*(Yi* <sup>=</sup> <sup>0</sup>*, μ)* <sup>=</sup> <sup>2</sup>*μ*. Minimizing the Poisson deviance loss function <sup>D</sup>*(Yn, μ)* in *<sup>μ</sup>* gives us the MLE for *<sup>μ</sup>* and *<sup>θ</sup>* <sup>=</sup> *h(μ)*, respectively. It is given by, see (3.24),

$$\widehat{\mu}^{\text{MLE}} = \widehat{\mu}^{\text{MLE}}\_{\mathcal{L}} = \frac{\sum\_{l=1}^{n} N\_{l}}{\sum\_{l=1}^{n} v\_{l}} = 7.36\%,$$

for learning data set *L* = {*Y*1*,...,Yn*}. This provides us with an in-sample Poisson deviance loss of <sup>D</sup>*(Yn, <sup>μ</sup>*MLE *<sup>L</sup> )* <sup>=</sup> <sup>D</sup>*(L, <sup>μ</sup>*MLE *<sup>L</sup> )* <sup>=</sup> <sup>25</sup>*.*<sup>213</sup> · <sup>10</sup>−2.

Since we do not have test data *T* , we explore tenfold cross-validation. We therefore partition the entire data at random into *K* = 10 disjoint sets *L*1*,...,L*10, and compute the tenfold cross-validation loss as described in (4.35). This gives us <sup>D</sup> CV <sup>=</sup> <sup>25</sup>*.*<sup>213</sup> · <sup>10</sup>−2, thus, we receive the same value as for the in-sample loss which says that we do not have in-sample over-fitting, here. This is not surprising in the homogeneous model *<sup>λ</sup>* <sup>=</sup> <sup>E</sup>*<sup>θ</sup>* [*Yi*]. We can also quantify the uncertainty in this estimate by the corresponding empirical standard deviation for *T<sup>k</sup>* = *L<sup>k</sup>*

$$\sqrt{\frac{1}{K-1}}\sum\_{k=1}^{K}\left(\mathfrak{D}\left(\mathcal{T}\_{k},\widehat{\mu}^{[-k]}\right)-\widehat{\mathfrak{D}}^{\rm CV}\right)^{2}=0.234\cdot10^{-2}.\tag{4.36}$$

This says that there is quite some fluctuation in the data because uncertainty in estimate <sup>D</sup> CV <sup>=</sup> <sup>25</sup>*.*<sup>213</sup> · <sup>10</sup>−<sup>2</sup> is roughly 1%. This finishes this example, and we will come back to it in Sect. 5.2.4, below. -

## *4.2.3 Akaike's Information Criterion*

The out-of-sample analysis in terms of GLs and cross-validation evaluates the predictive performance on unseen data. Another way of model selection is to study in-sample losses instead, but penalize model complexity. Akaike's information criterion (AIC), see Akaike [5], is the most popular tool that follows such a model selection methodology. AIC is based on a set of assumptions which should be fulfilled to apply, this is going to be discussed in this section; we therefore follow the lecture notes of Künsch [229].

Assume we have independent random variables *Yi* from some (unknown) density *f* . Assume we have two candidate models with densities *hθ* and *gϑ* from which we would like to select the preferred one for the given data *Y<sup>n</sup>* = *(Y*1*,...,Yn)*. The two unknown parameters in these densities *hθ* and *gϑ* are called *θ* and *ϑ*, respectively. We neither assume that one of the two models *hθ* and *gϑ* contains the true model *f* , nor that the two models are nested. That is, *f* , *hθ* and *gϑ* are quite general densities w.r.t. a given *σ*-finite measure *ν*.

Assume that both models under consideration have a unique MLE *<sup>θ</sup>*MLE <sup>=</sup> *<sup>θ</sup>*MLE*(Yn)* and *<sup>ϑ</sup>*MLE <sup>=</sup> *<sup>ϑ</sup>*MLE*(Yn)* which is based on the same observations *<sup>Y</sup>n*. AIC [5] says that model *<sup>h</sup> <sup>θ</sup>*MLE should be preferred over model *<sup>g</sup> <sup>ϑ</sup>*MLE if

$$-2\sum\_{l=1}^{n} \log\left(h\_{\widehat{\theta}^{\text{MLE}}}(Y\_{l})\right) + 2\dim(\theta) \quad \text{<} \quad -2\sum\_{l=1}^{n} \log\left(g\_{\widehat{\theta}^{\text{MLE}}}(Y\_{l})\right) + 2\dim(\vartheta),\tag{4.37}$$

where dim*(*·*)* denotes the dimension of the corresponding parameter. Thus, we compute the log-likelihoods of the data *<sup>Y</sup><sup>n</sup>* in the corresponding MLEs *<sup>θ</sup>*MLE and *<sup>ϑ</sup>*MLE, and we penalize the resulting values with the number of parameters to correct for model complexity. We give some remarks.

#### *Remarks 4.28*


$$\sum\_{l=1}^{n} \log \left( h\_{\widehat{\theta}^{\mathsf{MLE}}}(Y\_l) \right) = -\frac{n}{2} \log(2\pi\sigma^2) - \sum\_{l=1}^{n} \frac{1}{2\sigma^2} \left( Y\_l - \widehat{\theta}^{\mathsf{MLE}} \right)^2.$$

On the transformed scale we have MLE *<sup>ϑ</sup>*MLE <sup>=</sup> *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *cYi/n* <sup>=</sup> *<sup>c</sup> <sup>θ</sup>*MLE and log-likelihood in MLE *<sup>ϑ</sup>*MLE

$$\sum\_{l=1}^{n} \log \left( g\_{\widehat{\theta}^{\mathsf{MLLE}}}(cY\_{l}) \right) = -\frac{n}{2} \log(2\pi c^{2}\sigma^{2}) - \sum\_{l=1}^{n} \frac{1}{2c^{2}\sigma^{2}} \left( cY\_{l} - c\widehat{\theta}^{\mathsf{MLLE}} \right)^{2}.$$

Thus, find that the two log-likelihoods differ by −*n*log*(c)*, but we consider the same model only under different measurement units of the data. The same applies when we work, e.g., with a log-normal model or logged data in a Gaussian model.

We give a heuristic justification of AIC. In Example 3.10 we have seen that the MLE is obtained by minimizing the KL divergence from *hθ* to the empirical distribution *f <sup>n</sup>* of *<sup>Y</sup>n*. This motivates to use the KL divergence also for comparing the MLE estimated models to the true model, i.e., we consider the difference (supposed the densities are defined on the same domain)

$$\begin{split} D\_{\mathrm{KL}}\left(f\left\|h\_{\widehat{\theta}^{\mathrm{MLE}}}(\cdot)\right\|-D\_{\mathrm{KL}}\left(f\left\|g\_{\widehat{\theta}^{\mathrm{MLE}}}(\cdot)\right)\right) \\ = &\int \log\left(\frac{f(\mathbf{y})}{h\_{\widehat{\theta}^{\mathrm{MLE}}}(\mathbf{y})}\right) f(\mathbf{y})d\nu(\mathbf{y}) - \int \log\left(\frac{f(\mathbf{y})}{g\_{\widehat{\theta}^{\mathrm{MLE}}}(\mathbf{y})}\right) f(\mathbf{y})d\nu(\mathbf{y}) \\ = &\int \log\left(g\_{\widehat{\theta}^{\mathrm{MLE}}}(\mathbf{y})\right) f(\mathbf{y})d\nu(\mathbf{y}) - \int \log\left(h\_{\widehat{\theta}^{\mathrm{MLE}}}(\mathbf{y})\right) f(\mathbf{y})d\nu(\mathbf{y}). \end{split} \tag{4.38}$$

If this difference is negative, model *<sup>h</sup> <sup>θ</sup>*MLE should be preferred over model *<sup>g</sup> <sup>ϑ</sup>*MLE because it is closer to the true model *f* w.r.t. the KL divergence. Thus, we need to calculate the two integrals in (4.38). Since the true density *f* is not known, these two integrals need to be estimated.

As a first idea we estimate the integrals on the right-hand side empirically using the observations *Yn*, say, the first integral is estimated by

$$\frac{1}{n}\sum\_{i=1}^n \log\left(\mathfrak{g}\_{\widehat{\mathcal{O}}^{\text{MLE}}}(Y\_i)\right).$$

However, this will lead to a biased estimate because the MLE *<sup>ϑ</sup>*MLE exactly maximizes this empirical estimate (as a function of *ϑ*). The integrals in (4.38), on the other hand, can be interpreted as an out-of-sample calculation between independent random variables *Y<sup>n</sup>* (used for MLE) and *Y* ∼ *fdν* used in the integral. The bias results from the fact that in the empirical estimate the independence gets lost. Therefore, we need to correct this estimate for the bias in order to obtain a reasonable estimate for the difference of the KL divergences. Under the following assumptions this bias correction is asymptotically given by −dim*(ϑ)/n*: (1) <sup>√</sup>*n( <sup>ϑ</sup>*MLE*(Yn)* <sup>−</sup> *<sup>ϑ</sup>*0*)* is asymptotically normally distributed *<sup>N</sup> (*0*, (ϑ*0*)*−1*)* as *n* → ∞, where *ϑ*<sup>0</sup> is the parameter that minimizes the KL divergence from *gϑ* to *f* ; we also refer to Remarks 3.26. (2) The true *f* is sufficiently close to *gϑ*<sup>0</sup> such that the <sup>E</sup>*<sup>f</sup>* -covariance matrix of the score <sup>∇</sup>*<sup>ϑ</sup>* log*gϑ*<sup>0</sup> is close to the negative <sup>E</sup>*<sup>f</sup>* expected Hessian <sup>∇</sup><sup>2</sup> *<sup>ϑ</sup>* log*gϑ*<sup>0</sup> ; see also (3.36) and Sect. 11.1.4, below. In that case, *(ϑ*0*)* approximately corresponds to Fisher's information matrix *I*1*(ϑ*0*)* and AIC is justified.

This shows that AIC applies if both models are evaluated under the same observations *Yn*, the models need to use the MLEs, and asymptotic normality needs to hold with limits such that the true model is close to a member of the selected model classes {*hθ* ; *θ*} and {*gϑ* ; *ϑ*}. We remark that this is not the only set-up under which AIC can be justified, but other set-ups do not essentially differ.

The Bayesian information criterion (BIC) is similar to AIC but in a Bayesian context. The BIC says that model *<sup>h</sup> <sup>θ</sup>*MLE should be preferred over model *<sup>g</sup> <sup>ϑ</sup>*MLE if

$$-2\sum\_{l=1}^{n} \log\left(h\_{\hat{\theta}^{\text{MLE}}}(Y\_l)\right) + \log(n)\dim(\theta) \\ \quad < -2\sum\_{l=1}^{n} \log\left(g\_{\hat{\theta}^{\text{MLE}}}^{\*}(Y\_l)\right) + \log(n)\dim(\vartheta),$$

where *n* is the sample size of *Y<sup>n</sup>* used for model fitting. The BIC has been derived by Schwarz [331]. Therefore, it is also called Schwarz' information criterion (SIC).

## **4.3 Bootstrap**

The bootstrap method has been invented by Efron [115] and Efron–Tibshirani [118]. The bootstrap is used to simulate new data from either the empirical distribution *F n* or from an estimated model *F (*·; *θ)*. This allows, for instance, to evaluate the outer expectation in the expected deviance GL (4.32) which requires a data model for *Yn*. The presentation in this section is based on the lecture notes of Bühlmann–Mächler [59, Chapter 5].

## *4.3.1 Non-parametric Bootstrap Simulation*

Assume we have i.i.d. observations *Y*1*,...,Yn* from an unknown distribution function *F (*·; *θ )*. Based on these observations *Y* = *(Y*1*,...,Yn)* we choose a decision rule *<sup>A</sup>* : <sup>Y</sup> <sup>→</sup> <sup>A</sup> <sup>=</sup> <sup>⊆</sup> <sup>R</sup> which provides us with an estimator for *<sup>θ</sup>*

$$Y \mapsto \begin{array}{l} \widehat{\theta} = A(Y). \end{array} \tag{4.39}$$

Typically, the decision rule *A(*·*)* is a known function and we would like to determine the distributional properties of parameter estimator (4.39) as a function of the (random) observations *Y*. E.g., for any measurable set *C*, we might want to compute

$$\mathbb{P}\_{\theta}\left[\widehat{\theta}\in C\right] = \mathbb{P}\_{\theta}\left[A(\mathbf{Y})\in C\right] = \int \mathbb{1}\_{\{A(\mathbf{y})\in C\}} dP(\mathbf{y};\theta). \tag{4.40}$$

Since, typically, the true data generating distribution *Yi* ∼ *F (*·; *θ )* is not known, the distributional properties of *<sup>θ</sup>* cannot be determined, also not by Monte Carlo simulation. The idea behind bootstrap is to approximate *F (*·; *θ )*. Choose as approximation to *F (*·; *θ )* the empirical distribution of the i.i.d. observations *Y* given by, see (3.9),

$$\widehat{F}\_n(\mathbf{y}) = \frac{1}{n} \sum\_{i=1}^n \mathbb{1}\_{\{Y\_i \le \mathbf{y}\}} \qquad \text{for } \mathbf{y} \in \mathbb{R}.$$

The Glivenko–Cantelli theorem [64, 159] tells us that the empirical distribution *F <sup>n</sup>* converges uniformly to *F (*·; *θ )*, a.s., for *<sup>n</sup>* → ∞, so it should be a good approximation to *F (*·; *θ )* for large *n*. The idea now is to simulate from the empirical distribution *F n*.

(Non-parametric) bootstrap algorithm

	- (a) simulate i.i.d. observations *Y* ∗ <sup>1</sup> *,...,Y* <sup>∗</sup> *<sup>n</sup>* from *F <sup>n</sup>* (these are obtained by random drawings with replacements from the observations *Y*1*,...,Yn*; we denote this resampling distribution of *Y*<sup>∗</sup> = *(Y* <sup>∗</sup> <sup>1</sup> *,...,Y* <sup>∗</sup> *<sup>n</sup> )* by <sup>P</sup><sup>∗</sup> <sup>=</sup> <sup>P</sup><sup>∗</sup> *Y* );
	- (b) calculate the estimator *<sup>θ</sup>(m*∗*)* <sup>=</sup> *A(Y*∗*)*.

$$
\widehat{F}\_M^\*(\vartheta) = \frac{1}{M} \sum\_{m=1}^M \mathbf{1}\_{\{\widehat{\theta}^{(m\*)} \le \vartheta\}},
$$

for the estimated distribution of *<sup>θ</sup>*.

We can use the *empirical bootstrap distribution F* ∗ *<sup>M</sup>* as an estimate of the true distribution of *<sup>θ</sup>*, that is, we estimate and approximate

$$\mathbb{P}\_{\theta} \left[ \widehat{\theta} \in \mathcal{C} \right] \approx \widehat{\mathbb{P}}\_{\theta} \left[ \widehat{\theta} \in \mathcal{C} \right] \stackrel{\text{def.}}{=} \mathbb{P}\_{Y}^{\*} \left[ \widehat{\theta}^{\*} \in \mathcal{C} \right] \approx \frac{1}{M} \sum\_{m=1}^{M} \mathbb{1}\_{\{\widehat{\theta}^{(m)} \in \mathcal{C}\}},\qquad(4.41)$$

where P<sup>∗</sup> *<sup>Y</sup>* corresponds to the *bootstrap distribution* of Step (1a) of the above algorithm, and where we set *<sup>θ</sup>* <sup>∗</sup> <sup>=</sup> *A(Y*∗*)*. This bootstrap distribution <sup>P</sup><sup>∗</sup> *<sup>Y</sup>* is empirically approximated by the empirical bootstrap distribution *F* ∗ *<sup>M</sup>* for studying *θ* ∗.

#### *Remarks 4.29*

• The quality of the approximations in (4.41) depend on the richness of the observation *Y* = *(Y*1*,...,Yn)*, because the bootstrap distribution

$$\mathbb{P}\_Y^\* \left[ \widehat{\theta}^\* \in C \right] = \mathbb{P}\_{Y=\mathfrak{y}}^\* \left[ \widehat{\theta}^\* \in C \right],$$

depends on the realization *y* of the data *Y* from which we generate the bootstrap sample *Y*∗. It also depends on *M* and the explicit random drawings *Y* ∗ *<sup>i</sup>* providing the empirical bootstrap distribution *F* ∗ *<sup>M</sup>* . The latter uncertainty can be controlled since the bootstrap distribution P<sup>∗</sup> *<sup>Y</sup>* corresponds to a multinomial distribution, and the Glivenko–Cantelli theorem [64, 159] applies to *F* ∗ *<sup>M</sup>* and <sup>P</sup><sup>∗</sup> *<sup>Y</sup>* for *M* → ∞. The former uncertainty inherited from the realization *Y* = *y* cannot be diminished because we cannot enrich the observation *Y*.

• The empirical bootstrap distribution *F* ∗ *<sup>M</sup>* can be used to estimate the mean of the estimator *<sup>θ</sup>* given in (4.39)

$$
\widehat{\mathbb{E}}\_{\theta} \left[ \widehat{\theta} \right] = \mathbb{E}\_{Y}^{\*} \left[ \widehat{\theta}^{\*} \right] \\
\approx \ \frac{1}{M} \sum\_{m=1}^{M} \widehat{\theta}^{(m\*)},
$$

and its variance

$$\widehat{\text{Var}}\_{\theta} \left( \widehat{\theta} \right) = \text{Var}\_{\mathbb{P}\_{Y}^{\*}} \left( \widehat{\theta}^{\*} \right) \\ \approx \frac{1}{M - 1} \sum\_{m = 1}^{M} \left( \widehat{\theta}^{(m \ast )} - \frac{1}{M} \sum\_{k = 1}^{M} \widehat{\theta}^{(k \ast )} \right)^{2} .$$


$$Y\_i^\* = \widehat{\mu}\_i + \widehat{\sigma}\_i \widehat{\varepsilon}\_i^\*.$$

The *wild bootstrap* proposed by Wu [386] additionally uses a centered and normalized i.i.d. random variable *Vi* (also being independent of *<sup>ε</sup>*<sup>∗</sup> *<sup>i</sup>* ) to modify the residual bootstrap observations to

$$Y\_{l}^{\*} = \widehat{\mu}\_{l} + \widehat{\sigma}\_{l} V\_{l} \widehat{\varepsilon}\_{l}^{\*}.$$

#### 4.3 Bootstrap 109

The bootstrap is called *consistent* for *<sup>θ</sup>* if we have for all *<sup>z</sup>* <sup>∈</sup> <sup>R</sup> the following convergence in probability as *n* → ∞

$$\mathbb{P}\_{\theta}\left[\sqrt{n}\left(\widehat{\theta}-\theta\right)\leq z\right]-\mathbb{P}\_{Y}^{\*}\left[\sqrt{n}\left(\widehat{\theta}^{\*}-\widehat{\theta}\right)\leq z\right] \stackrel{\text{prob.}}{\rightarrow}0,$$

the quantities *<sup>θ</sup>* <sup>=</sup> *θn* and *<sup>θ</sup>* <sup>∗</sup> <sup>=</sup> *<sup>θ</sup>* <sup>∗</sup> *<sup>n</sup>* depend on (the size *n* of) the observation *Y* = *Yn*; the convergence in probability is needed because *Y* = *Y<sup>n</sup>* are random vectors. Assume that *<sup>θ</sup>*MLE <sup>=</sup> *<sup>θ</sup>* is the MLE of *<sup>θ</sup>* satisfying the assumptions of Theorem 3.28. Then we have asymptotic normality, see (3.30),

$$
\sqrt{n}\left(\widehat{\theta} - \theta\right) \implies \mathcal{N}\left(0, \mathcal{Z}\_{\mathbb{I}}(\theta)^{-1}\right) \qquad \text{as } n \to \infty,
$$

with Fisher's information *I*1*(θ )*. Bootstrap consistency then requires

$$\sqrt{n}\left(\widehat{\boldsymbol{\theta}}^{\*} - \widehat{\boldsymbol{\theta}}\right) \stackrel{\mathbb{P}^{\*}\_{\mathbf{Y}}}{\Longrightarrow} \mathcal{N}\left(0, \mathcal{Z}\_{\mathbf{l}}(\boldsymbol{\theta})^{-1}\right) \qquad \text{in probability as } n \to \infty.$$

Bootstrap consistency typically holds if *<sup>θ</sup>* is asymptotically normal (as *<sup>n</sup>* → ∞) and if the underlying data *Yi* is i.i.d. Moreover, bootstrap consistency usually implies consistent variance and bias estimation

$$\begin{array}{ccccc} \text{Var}\_{\mathbb{P}\_{\mathbf{Y}}^{\*}}\left(\widehat{\theta}^{\*}\right) & \underset{\theta}{\text{prob.}} & 1 \\ \hline \text{Var}\_{\theta}\left(\widehat{\theta}\right) & \end{array} \quad \text{and} \qquad \begin{array}{ccccc} \mathbb{E}\_{\mathbf{Y}}^{\*}\left[\widehat{\theta}^{\*}\right] - \widehat{\theta} & \underset{\theta}{\text{prob.}} & 1 \\ \hline \mathbb{E}\_{\theta}\left[\widehat{\theta}\right] - \theta & \end{array} \quad \text{as } n \to \infty.$$

For more information and bootstrap confidence intervals we refer to Chapter 5 in the lecture notes of Bühlmann–Mächler [59].

## *4.3.2 Parametric Bootstrap Simulation*

For the parametric bootstrap we assume to know the parametric family *F* = {*F (*·; *θ )*; *θ* ∈ } from which the i.i.d. observations *Y*1*,...,Yn* ∼ *F (*·; *θ )* have been generated from, and only the explicit choice of the parameter *θ* ∈ is not known. Based on these observations we construct an estimator *<sup>θ</sup>* <sup>=</sup> *A(Y)*, for the unknown parameter *θ* ∈ .

(Parametric) bootstrap algorithm

(1) Repeat for *m* = 1*,...,M*


(2) Return *<sup>θ</sup>(*1∗*) ,..., <sup>θ</sup>(M*∗*)* and the resulting empirical bootstrap distribution

$$
\widehat{F}\_M^\*(\vartheta) = \frac{1}{M} \sum\_{m=1}^M \mathbb{1}\_{\{\widehat{\theta}^{(m\*)} \le \vartheta\}}.
$$

We then estimate and approximate the distribution of *<sup>θ</sup>* analogously to (4.41), and the same remarks apply as for the non-parametric bootstrap. The parametric bootstrap has the advantage that it can enrich the data by sampling new observations from the distribution *F (*·; *θ)*. A shortfall of the parametric bootstrap will occur if the family *F* is misspecified, then the bootstrap sample *Y*<sup>∗</sup> will only poorly describe the true data *Y*, e.g., if the data shows over-dispersion but the select family *F* does not allow to model such over-dispersion.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 5 Generalized Linear Models**

Most of the theory in the previous chapters has been based on the assumption of having similarity (or homogeneity) between the different observations. This was expressed by making an i.i.d. assumption on the observations, see, e.g., Sect. 3.3.2. In many practical applications such a homogeneity assumption is not reasonable, one may for example think of car insurance pricing where different car drivers have different driving experience and they drive different cars, or of health insurance where policyholders may have different genders and ages. Figure 5.1 shows a health insurance example where the claim sizes depend on the gender and the age of the policyholders. The most popular statistical models that are able to cope with such heterogeneous data are the *generalized linear models* (GLMs). The notion of GLMs has been introduced in the seminal work of Nelder–Wedderburn [283] in 1972. Their work has introduced a unified procedure for modeling and fitting distributions within the EDF to data having systematic differences (effects) that can be described by explanatory variables. Today, GLMs are the state-of-theart statistical models in many applied fields including statistics, actuarial science and economics. However, the specific use of GLMs in the different fields may substantially differ. In fields like actuarial science these models are mainly used for *predictive* modeling, in other fields like economics or social sciences GLMs have become the main tool in exploring and *explaining* (hopefully) causal relations. For a discussion on "predicting" versus "explaining" we refer to Shmueli [338].

It is difficult to give a good list of references for GLMs, since GLMs and their offsprings are present in almost every statistical modeling publication and in every lecture on statistics. Classical statistical references are the books of McCullagh– Nelder [265], Fahrmeir–Tutz [123] and Dobson [107], in the actuarial literature we mention the textbooks (in alphabetical order) of Charpentier [67], De Jong–Heller [89], Denuit et al. [99–101], Frees [134] and Ohlsson–Johansson [290], but this list is far from being complete.

In this chapter we introduce and discuss GLMs in the context of actuarial modeling. We do this in such a way that GLMs can be seen as a building block of network regression models which will be the main topic of Chap. 7 on deep learning.

## **5.1 Generalized Linear Models and Log-Likelihoods**

## *5.1.1 Regression Modeling*

We start by assuming of having independent random variables *Y*1*,...,Yn* which are described by a fixed member of the EDF. That is, we assume that all *Yi* are independent and have densities w.r.t. a *σ*-finite measure *ν* on R given by

$$Y\_{l} \sim f(\mathbf{y}\_{l}; \theta\_{l}, v\_{l}/\varphi) = \exp\left\{\frac{\mathbf{y}\_{l}\theta\_{l} - \kappa(\theta\_{l})}{\varphi/v\_{l}} + a(\mathbf{y}\_{l}; v\_{l}/\varphi)\right\} \qquad \text{for } 1 \le i \le n,\tag{5.1}$$

with canonical parameters *θi* <sup>∈</sup> ˚, exposures *vi <sup>&</sup>gt;* 0 and dispersion parameter *ϕ >* 0. Throughout, we assume that the effective domain has a non-empty interior. There is a fundamental difference between (5.1) and Example 3.5. We now allow every random variable *Yi* to have its own canonical parameter *θi* <sup>∈</sup> ˚. We call this a *heterogeneous* situation because the observations are allowed to differ in a systematic way expressed by different canonical parameters. This is highlighted by the lines in the health insurance example of Fig. 5.1 where (expected) claim sizes differ by gender and age of policyholder.

In Sect. 4.1.2 we have introduced the *saturated model* where every observation *Yi* has its own parameter *θi*. In general, if we have *n* observations *Y* = *(Y*1*,...,Yn)*- we can estimate at most *n* parameters. The other extreme case is the homogeneous one, meaning that *θi* <sup>=</sup> *<sup>θ</sup>* <sup>∈</sup> ˚ for all 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*. In this latter case we have exactly one parameter to estimate, and we call this model *null model*, *intercept model* or *homogeneous model*, because all components of *Y* are assumed to follow the same law expressed in a single common parameter *θ*. Both the saturated model and the null model may behave very poorly in predicting new observations. Typically, the saturated model fully reflects the data *Y* including the noisy part (random component, irreducible risk, see Remarks 4.2) and, therefore, it is not useful for prediction. We also say that this model (in-sample) over-fits to the data *Y* and does not generalize (out-of-sample) to new data. The null model often has a poor predictive performance because if the data has systematic effects these cannot be captured by a null model. GLMs try to find a good balance between these two extreme cases, by trying to extract (only) the systematic effects from noisy data *Y*. We therefore model the canonical parameters *θi* as a low-dimensional function of *explanatory variables* which capture the systematic effects in the data. In Fig. 5.1 gender and age of policyholder play the role of such explanatory variables.

Assume that each observation *Yi* is equipped with a *feature* (explanatory variable, covariate) *x<sup>i</sup>* that belongs to a fixed given *feature space X*. These features *x<sup>i</sup>* are assumed to describe the *systematic effects* in the observations *Yi*, i.e., these features are assumed to be appropriate descriptions of the heterogeneity between the observations. In a nutshell, we then assume of having a suitable *regression function*

$$
\theta: \mathcal{X} \to \check{\Theta}, \qquad \mathfrak{x} \mapsto \theta(\mathfrak{x}),
$$

such that we can appropriately describe the observations by

$$Y\_{l} \stackrel{\text{ind.}}{\sim} f(\mathbf{y}\_{l}; \theta\_{l} = \theta(\mathbf{x}\_{l}), \mathbf{v}\_{l}/\varphi) = \exp\left\{ \frac{\mathbf{y}\_{l}\theta(\mathbf{x}\_{l}) - \kappa(\theta(\mathbf{x}\_{l}))}{\varphi/v\_{l}} + a(\mathbf{y}\_{l}; \mathbf{v}\_{l}/\varphi) \right\},\tag{5.2}$$

for 1 ≤ *i* ≤ *n*. As a result we receive for the first moment of *Yi*, see Corollary 2.14,

$$
\mu\_l = \mu(\mathbf{x}\_l) = \mathbb{E}\_{\theta(\mathbf{x}\_l)} \left[ Y\_l \right] = \kappa'(\theta(\mathbf{x}\_l)).\tag{5.3}
$$

Thus, the regression function *<sup>θ</sup>* : *<sup>X</sup>* <sup>→</sup> ˚ is assumed to describe the systematic differences (effects) between the random variables *Y*1*,...,Yn* being expressed by the means *μ(xi)* for features *x*1*,..., xn*. In GLMs this regression function takes a linear form after a suitable transformation, which exactly motivates the terminology *generalized linear model*.

## *5.1.2 Definition of Generalized Linear Models*

We start with the discussion of the features *x* ∈ *X*. Features are also called explanatory variables, covariates, independent variables or regressors. Throughout, we assume that the features *x* = *(x*0*, x*1*,...,xq )* include a first component *x*<sup>0</sup> = 1, and we choose feature space *<sup>X</sup>* ⊂ {1} × <sup>R</sup>*<sup>q</sup>* . The inclusion of this first component *x*<sup>0</sup> = 1 is useful in what follows. We call this first component *intercept* or *bias component* because it will be modeling an intercept of a regression model. The

null model (homogeneous model) has features that only consist of this intercept component. For later purposes it will be useful to introduce the *design matrix* X which collects the features *x*1*,..., x<sup>n</sup>* ∈ *X* of all responses *Y*1*,...,Yn*. The design matrix is defined by

$$\mathfrak{X} = (\mathfrak{x}\_1, \dots, \mathfrak{x}\_n)^\top = \begin{pmatrix} 1 \ x\_{1,1} \ \cdots \ x\_{1,q} \\ \vdots \ \vdots \ \ddots \ \vdots \\ 1 \ x\_{n,1} \ \cdots \ x\_{n,q} \end{pmatrix} \in \mathbb{R}^{n \times (q+1)}.\tag{5.4}$$

Based on these choices we assume existence of a *regression parameter <sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+<sup>1</sup> and of a strictly monotone and smooth *link function <sup>g</sup>* : *<sup>M</sup>* <sup>→</sup> <sup>R</sup> such that we can express (5.3) by the following function (we drop index *i*)

$$\mathfrak{x} \mapsto \mathfrak{g}(\mu(\mathfrak{x})) = \mathfrak{g}\left(\mathbb{E}\_{\theta(\mathfrak{x})} \left[ Y \right] \right) = \eta(\mathfrak{x}) = \langle \mathfrak{f}, \mathfrak{x} \rangle = \beta\_0 + \sum\_{j=1}^{q} \beta\_j \mathfrak{x}\_j. \tag{5.5}$$

Here, ·*,*· describes the scalar product in the Euclidean space <sup>R</sup>*q*+1, *θ (x)* <sup>=</sup> *h(μ(x))* is the resulting canonical parameter (using canonical link *h* = *(κ )*−1), and *η(x)* is the so-called *linear predictor*. After applying a suitable link function *g*, the systematic effects of the random variable *Y* with features *x* can be described by a linear predictor *η(x)* = *β, x* , linear in the components of *x* ∈ *X*. This gives a particular functional form to (5.3), and the random variables *Y*1*,...,Yn* share a common regression parameter *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+1. Remark that the link function *<sup>g</sup>* used in (5.5) can be different from the canonical link *h* used to calculate *θ (x)* = *h(μ(x))*. We come back to this distinction below.

#### **Summary of** (5.5)


We can either express this GLM regression structure in the dual (mean) parameter space *<sup>M</sup>* or in the effective domain ˚, see Remarks 2.9,

$$\begin{aligned} \mathbf{x} &\mapsto \mu(\mathbf{x}) = \operatorname{g}^{-1}(\eta(\mathbf{x})) = \operatorname{g}^{-1}\langle \pmb{\mathcal{B}}, \mathbf{x} \rangle \in \mathcal{M} &\text{or} \\ \mathbf{x} &\mapsto \theta(\mathbf{x}) = (h \circ \operatorname{g}^{-1})(\eta(\mathbf{x})) = (h \circ \operatorname{g}^{-1})\langle \pmb{\mathcal{B}}, \mathbf{x} \rangle \in \mathring{\mathbf{\Theta}}, \end{aligned}$$

where *(h* ◦ *<sup>g</sup>*−1*)* is the composition of the inverse link *<sup>g</sup>*−<sup>1</sup> and the canonical link *<sup>h</sup>*. For the moment, the link function *g* is quite general. In practice, the explicit choice needs some care. The right-hand side of (5.5) is defined on the whole real line if at least one component of *<sup>x</sup>* is both-sided unbounded. On the other hand, *<sup>M</sup>* and ˚ may be bounded sets. Therefore, the link function *g* may require some restrictions such that the domain and the range fulfill the necessary constraints. The dimension of *β* should satisfy 1 ≤ 1 + *q* ≤ *n*, the lower bound will provide a null model and the upper bound a saturated model.

## *5.1.3 Link Functions and Feature Engineering*

As link function we choose a strictly monotone and smooth function *<sup>g</sup>* : *<sup>M</sup>* <sup>→</sup> <sup>R</sup> such that we do not have any conflicts in domains and ranges. Beside these requirements, we may want further properties for the link function *g* and the features *x*. From (5.5) we have

$$
\mu(\mathbf{x}) = \mathbb{E}\_{\theta(\mathbf{x})} \left[ Y \right] = \mathbf{g}^{-1} \langle \boldsymbol{\mathfrak{g}}, \mathbf{x} \rangle. \tag{5.6}
$$

Of course, a basic requirement is that the selected features *x* can appropriately describe the mean of *Y* by the function in (5.6), see also Fig. 5.1. This may require so-called *feature engineering* of *x*, for instance, we may want to replace the first component *x*<sup>1</sup> of the *raw features x* by, say, *x*<sup>2</sup> <sup>1</sup> in the *pre-processed features*. For example, if this first component describes the age of the insurance policyholder, then, in some regression problems, it might be more appropriate to consider age<sup>2</sup> instead of age to bring the predictive problem into structure (5.6). It may also be that we would like to enforce a certain type of *interaction* between the components of the raw features. For instance, we may include in a pre-processed feature a component *x*1*/x*<sup>2</sup> <sup>2</sup> which might correspond to weight*/*height<sup>2</sup> if the policyholder has body weight *x*<sup>1</sup> and body height *x*2. In fact, this pre-processed feature is exactly the body mass index of the policyholder. We will come back to feature engineering in Sect. 5.2.2, below.

Another important requirement is the ability of model interpretation. In insurance pricing problems, one often prefers additive and multiplicative effects in feature components. Choosing the identity link *g(m)* = *m* we receive a model with additive effects

$$\mu(\mathbf{x}) = \mathbb{E}\_{\theta(\mathbf{x})} \left[ Y \right] = \langle \mathbf{\mathcal{J}}, \mathbf{x} \rangle = \beta\_0 + \sum\_{j=1}^{q} \beta\_j x\_j, \mathbf{x}$$

and choosing the log-link *g(m)* = log*(m)* we receive a model with multiplicative effects

$$\mu(\mathbf{x}) = \mathbb{E}\_{\theta(\mathbf{x})} \left[ Y \right] = \exp \langle \theta, \mathbf{x} \rangle = e^{\beta\_0} \prod\_{j=1}^{q} e^{\beta\_j \chi\_j} \cdot \mathbf{x}$$

The latter is probably the most commonly used GLM in insurance pricing because it leads to explainable tariffs where feature values directly relate to price de- and increases in percentages of a base premium exp{*β*0}.

Another very popular choice is the canonical (natural) link, i.e., *g* = *h* = *(κ )*−1. The canonical link substantially simplifies the analysis and it has very favorable statistical properties (as we will see below). However, in some applications practical needs overrule good statistical properties. Under the canonical link *g* = *h* we have in the dual mean parameter space *M* and in the effective domain , respectively,

$$\mathbf{x} \mapsto \mu(\mathbf{x}) = \kappa'(\eta(\mathbf{x})) = \kappa'(\mathfrak{P}, \mathbf{x}) \qquad \text{and} \qquad \mathbf{x} \mapsto \theta(\mathbf{x}) = \eta(\mathbf{x}) = \langle \mathfrak{P}, \mathbf{x} \rangle.$$

Thus, the linear predictor *η* and the canonical parameter *θ* coincide under the canonical link choice *g* = *h* = *(κ )*−1.

## *5.1.4 Log-Likelihood Function and Maximum Likelihood Estimation*

After having a fully specified GLM within the EDF, there remains estimation of the regression parameter *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+1. This is done within the framework of MLE.

The log-likelihood function of *Y* = *(Y*1*,...,Yn)* for regression parameter *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+<sup>1</sup> is given by, see (5.2) and we use the independence between the *Yi*'s,

(continued)

$$\mathfrak{F} \mapsto \ell\_Y(\mathfrak{F}) = \sum\_{l=1}^n \frac{v\_l}{\varphi} \left[ Y\_l h(\mu(\mathfrak{x}\_l)) - \kappa \left( h(\mu(\mathfrak{x}\_l)) \right) \right] + a(Y\_l; v\_l/\varphi), \tag{5.7}$$

where we set *μ(xi)* <sup>=</sup> *<sup>g</sup>*−<sup>1</sup>*<sup>β</sup>, <sup>x</sup><sup>i</sup>* . For the canonical link *<sup>g</sup>* <sup>=</sup> *<sup>h</sup>* <sup>=</sup> *(κ )*−<sup>1</sup> this simplifies to

$$\mathfrak{G} \mapsto \ell\_Y(\mathfrak{G}) = \sum\_{l=1}^n \frac{v\_l}{\varphi} \left[ Y\_l \langle \mathfrak{G}, \mathbf{x}\_l \rangle - \kappa \langle \mathfrak{G}, \mathbf{x}\_l \rangle \right] + a(Y\_l; v\_l / \varphi). \tag{5.8}$$

MLE of *β* needs maximization of log-likelihoods (5.7) and (5.8), respectively; these are the GLM counterparts to the homogeneous case treated in Section 3.3.2. We calculate the score, we set *ηi* <sup>=</sup>*β, <sup>x</sup><sup>i</sup>* and *μi* <sup>=</sup> *μ(xi)* <sup>=</sup> *<sup>g</sup>*−<sup>1</sup>*<sup>β</sup>, <sup>x</sup><sup>i</sup>* ,

$$\begin{split} s(\boldsymbol{\theta}, Y) &= \nabla\_{\boldsymbol{\theta}} \ell\_{Y}(\boldsymbol{\theta}) = \sum\_{i=1}^{n} \frac{v\_{i}}{\varphi} \left[ Y\_{i} - \mu\_{i} \right] \nabla\_{\boldsymbol{\theta}} h(\mu(\mathbf{x}\_{i})) \\ &= \sum\_{i=1}^{n} \frac{v\_{i}}{\varphi} \left[ Y\_{i} - \mu\_{i} \right] \frac{\partial h(\mu\_{i})}{\partial \mu\_{i}} \frac{\partial \mu\_{i}}{\partial \eta\_{i}} \nabla\_{\boldsymbol{\theta}} \eta(\mathbf{x}\_{i}) \\ &= \sum\_{i=1}^{n} \frac{v\_{i}}{\varphi} \frac{Y\_{i} - \mu\_{i}}{V(\mu\_{i})} \left( \frac{\partial g(\mu\_{i})}{\partial \mu\_{i}} \right)^{-1} \mathbf{x}\_{i}, \end{split} \tag{5.9}$$

where we use the definition of the variance function *V (μ)* = *(κ* ◦ *h)(μ)*, see Corollary 2.14. We define the diagonal working weight matrix, which in general depends on *<sup>β</sup>* through the means *μi* <sup>=</sup> *<sup>g</sup>*−<sup>1</sup>*<sup>β</sup>, <sup>x</sup><sup>i</sup>* ,

$$W(\mathcal{B}) = \text{diag}\left(\left(\frac{\partial \mathbf{g}(\mu\_l)}{\partial \mu\_l}\right)^{-2} \frac{v\_l}{\varphi} \frac{1}{V(\mu\_l)}\right)\_{1 \le l \le n} \in \mathbb{R}^{n \times n},$$

and the working residuals

$$\mathcal{R} = \mathcal{R}(Y, \mathcal{J}) = \left(\frac{\partial \mathbf{g}(\mu\_l)}{\partial \mu\_l} (Y\_l - \mu\_l)\right)\_{1 \le l \le n}^{\parallel} \in \mathbb{R}^n.$$

This allows us to write the score equations in a compact form, which provides the following proposition.

**Proposition 5.1** *The MLE for β is found by solving the score equations*

$$s(\boldsymbol{\beta}, Y) = \nabla\_{\boldsymbol{\beta}} \ell\_Y(\boldsymbol{\beta}) = \mathfrak{X}^{\top} W(\boldsymbol{\beta}) R(Y, \boldsymbol{\beta}) = 0.$$

*For the canonical link g* = *h* = *(κ )*−<sup>1</sup> *the score equations simplify to*

$$s(\mathfrak{B}, Y) = \nabla\_{\mathfrak{B}} \ell\_Y(\mathfrak{B}) = \mathfrak{X}^{\top} \text{diag} \left( \frac{v\_l}{\varphi} \right)\_{1 \le i \le n} \left( Y - \kappa'(\mathfrak{X}\mathfrak{B}) \right) = 0,$$

*where κ (*X*β)* <sup>∈</sup> <sup>R</sup>*<sup>n</sup> is understood element-wise.*

#### *Remarks 5.2*


Similarly to Remarks 3.17 we can calculate Fisher's information matrix w.r.t. *β* through the negative expected Hessian of *<sup>Y</sup> (β)*.

We get Fisher's information matrix w.r.t. *β*

$$\mathcal{Z}(\boldsymbol{\theta}) = \mathbb{E}\_{\boldsymbol{\theta}} \left[ \nabla\_{\boldsymbol{\theta}} \ell\_Y(\boldsymbol{\theta}) \left( \nabla\_{\boldsymbol{\theta}} \ell\_Y(\boldsymbol{\theta}) \right)^\top \right] = -\mathbb{E}\_{\boldsymbol{\theta}} \left[ \nabla\_{\boldsymbol{\theta}}^2 \ell\_Y(\boldsymbol{\theta}) \right] = \mathfrak{X}^\top W(\boldsymbol{\theta}) \mathfrak{X}. \tag{5.10}$$

If the design matrix <sup>X</sup> <sup>∈</sup> <sup>R</sup>*n*×*(q*+1*)* has full rank *<sup>q</sup>* <sup>+</sup> <sup>1</sup> <sup>≤</sup> *<sup>n</sup>*, Fisher's information matrix *I(β)* is positive definite.

Dispersion parameter *ϕ >* 0 has been treated as a nuisance parameter above. Its explicit specification does not influence the MLE of *β* because it cancels in the score equations. If necessary, we can also estimate this dispersion parameter with MLE. This requires solving the additional score equation

$$\frac{\partial}{\partial \varphi} \ell Y(\pmb{\varphi}) = \sum\_{l=1}^{n} -\frac{v\_l}{\varphi^2} \left[ Y\_l h(\mu(\mathbf{x}\_l)) - \kappa \left( h(\mu(\mathbf{x}\_l)) \right) \right] + \frac{\partial}{\partial \varphi} a(Y\_l; v\_l/\varphi) = 0,\tag{5.11}$$

and we can plug in the MLE of *β* (which can be estimated independently of *ϕ*). Fisher's information matrix is in this extended framework given by

$$\mathcal{Z}(\mathfrak{f},\boldsymbol{\varphi}) = -\mathbb{E}\_{\mathfrak{f}}\left[\nabla^{2}\_{(\mathfrak{f},\boldsymbol{\varphi})}\ell\_{\boldsymbol{Y}}(\mathfrak{f},\boldsymbol{\varphi})\right] = \begin{pmatrix} \mathfrak{X}^{\top}W(\mathfrak{f})\mathfrak{X} & 0\\ 0 & -\mathbb{E}\_{\mathfrak{f}}\left[\partial^{2}\ell\_{\boldsymbol{Y}}(\mathfrak{f},\boldsymbol{\varphi})/\partial\boldsymbol{\varphi}^{2}\right] \end{pmatrix}.$$

that is, the off-diagonal terms between *β* and *ϕ* are zero.

In view of Proposition 5.1 we need a root search algorithm to obtain the MLE of *β*. Typically, one uses Fisher's scoring method or the iterative re-weighted least squares (IRLS) algorithm to solve this root search problem. This is a main result derived in the seminal work of Nelder–Wedderburn [283] and it explains the popularity of GLMs, namely, GLMs can be solved efficiently by this algorithm. Fisher's scoring method/IRLS algorithm explore the updates for *t* ≥ 0 until convergence

$$
\widehat{\boldsymbol{\mathfrak{F}}}^{(t)} \mapsto \widehat{\boldsymbol{\mathfrak{F}}}^{(t+1)} = \left(\boldsymbol{\mathfrak{X}}^{\top} \boldsymbol{W}(\widehat{\boldsymbol{\mathfrak{F}}}^{(t)}) \boldsymbol{\mathfrak{X}}\right)^{-1} \boldsymbol{\mathfrak{X}}^{\top} \boldsymbol{W}(\widehat{\boldsymbol{\mathfrak{F}}}^{(t)}) \left(\boldsymbol{\mathfrak{X}} \widehat{\boldsymbol{\mathfrak{F}}}^{(t)} + \boldsymbol{R}(\boldsymbol{Y}, \widehat{\boldsymbol{\mathfrak{F}}}^{(t)})\right), \tag{5.12}
$$

where all terms on the right-hand side are evaluated at algorithmic time *t*. If we have *n* observations *Y* = *(Y*1*,...,Yn)* we can estimate at most *n* parameters. Therefore, in our GLM we assume to have a regression parameter *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+<sup>1</sup> of dimension *<sup>q</sup>* <sup>+</sup> <sup>1</sup> <sup>≤</sup> *<sup>n</sup>*. Moreover, we require that the design matrix <sup>X</sup> has full rank *q* + 1 ≤ *n*. Otherwise the regression parameter is not uniquely identifiable since linear dependence in the columns of X allows us to reduce the dimension of the parameter space to a smaller representation. This is also needed to calculate the inverse matrix in (5.12). This motivates the following assumption.

**Assumption 5.3** *Throughout, we assume that the design matrix* <sup>X</sup> <sup>∈</sup> <sup>R</sup>*n*×*(q*+1*) has full rank <sup>q</sup>* <sup>+</sup> <sup>1</sup> <sup>≤</sup> *<sup>n</sup>.*

#### *Remarks 5.4 (Justification of Fisher's Scoring Method/IRLS Algorithm)*

• We give a short justification of Fisher's scoring method/IRLS algorithm, for a more detailed treatment we refer to Section 2.5 in McCullagh–Nelder [265] and Section 2.2 in Fahrmeir–Tutz [123].

The Newton–Raphson algorithm provides a numerical scheme to find solutions to the score equations. It requires to iterate for *t* ≥ 0

$$
\widehat{\mathfrak{F}}^{(t)} \mapsto \widehat{\mathfrak{F}}^{(t+1)} = \widehat{\mathfrak{F}}^{(t)} + \widehat{\mathbb{Z}}(\widehat{\mathfrak{F}}^{(t)})^{-1} \text{ s}(\widehat{\mathfrak{F}}^{(t)}, Y),
$$

where *<sup>I</sup>(β)* = −∇<sup>2</sup> *<sup>β</sup><sup>Y</sup> (β)* denotes the observed information matrix in *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+1. The calculation of the inverse of the observed information matrix *( <sup>I</sup>( <sup>β</sup>(t )))*−<sup>1</sup> can be time consuming and unstable because we need to calculate second derivatives and the eigenvalues of the observed information matrix can be close to zero. A stable scheme is obtained by replacing the observed information matrix *<sup>I</sup>(β)* by Fisher's information matrix *<sup>I</sup>(β)* <sup>=</sup> <sup>E</sup>*<sup>β</sup>* [ *<sup>I</sup>(β)*] being positive definite under Assumption 5.3; this provides a quasi-Newton method. Thus, for Fisher's scoring method we iterate for *t* ≥ 0

$$
\widehat{\mathfrak{F}}^{(t)} \mapsto \widehat{\mathfrak{F}}^{(t+1)} = \widehat{\mathfrak{F}}^{(t)} + \mathcal{T}(\widehat{\mathfrak{F}}^{(t)})^{-1} \circ (\widehat{\mathfrak{F}}^{(t)}, Y), \tag{5.13}
$$

and rewriting this provides us exactly with (5.12). The latter can also be interpreted as an IRLS scheme where the response *g(Yi)* is replaced by an adjusted linearized version *Zi* <sup>=</sup> *g(μi)* <sup>+</sup> *∂g(μi) ∂μi (Yi* − *μi)*. This corresponds to the last bracket in (5.12), and with corresponding weights.

• Under the canonical link choice, Fisher's information matrix and the observed information matrix coincide, i.e. *<sup>I</sup>(β)* <sup>=</sup> *<sup>I</sup>(β)*, and the Newton–Raphson algorithm, Fisher's scoring method and the IRLS algorithm are identical. This can easily be seen from Proposition 5.1. We receive under the canonical link choice

$$\nabla\_{\boldsymbol{\mathcal{B}}}^{2} \ell\_{\mathcal{Y}}(\boldsymbol{\mathfrak{z}}) = \left. -\widehat{\boldsymbol{\mathcal{Z}}}(\boldsymbol{\mathfrak{z}}) = -\boldsymbol{\mathfrak{x}}^{\top} \text{diag} \left( \frac{v\_{l}}{\boldsymbol{\varphi}} V(\boldsymbol{\mu}\_{l}) \right)\_{1 \le l \le n} \boldsymbol{\mathfrak{x}} \quad (5.14)$$

$$= -\boldsymbol{\mathfrak{x}}^{\top} W(\boldsymbol{\mathfrak{z}}) \boldsymbol{\mathfrak{x}} = \left. -\boldsymbol{\mathcal{Z}}(\boldsymbol{\mathfrak{z}}) \right.$$

The full rank assumption *<sup>q</sup>* <sup>+</sup> <sup>1</sup> <sup>≤</sup> *<sup>n</sup>* on the design matrix <sup>X</sup> implies that Fisher's information matrix *I(β)* is positive definite. This in turn implies that the log-likelihood function *<sup>Y</sup> (β)* is strictly concave, providing uniqueness of a critical point (supposed it exists). This indicates that the canonical link has very favorable properties for MLE. Examples 5.5 and 5.6 give two examples not using the canonical link, the first one is a concave maximization problem, the second one is not for *p >* 2.

*Example 5.5 (Gamma Model with Log-Link)* We study the gamma distribution as a single-parameter EDF model, choosing the shape parameter *α* = 1*/ϕ* as the inverse of the dispersion parameter, see Sect. 2.2.2. Cumulant function *κ(θ )* = − log*(*−*θ )* gives us the canonical link *θ* = *h(μ)* = −1*/μ*. Moreover, we choose the log-link *η* = *g(μ)* = log*(μ)* for the GLM. This gives a canonical parameter *θ* = − exp{−*η*}. We receive the score

$$\text{res}(\mathfrak{F}, Y) = \nabla\_{\mathfrak{F}} \ell\_Y(\mathfrak{F}) = \sum\_{l=1}^n \frac{v\_l}{\varphi} \left[ \frac{Y\_l}{\mu\_l} - 1 \right] \mathfrak{x}\_l \\ = \mathfrak{X}^\top \text{diag} \left( \frac{v\_l}{\varphi} \right)\_{1 \le l \le n} \mathbf{R}(Y, \mathfrak{F}).$$

Unlike in other examples with non-canonical links, we receive a favorable expression here because only one term in the square bracket depends on the regression parameter *β*, or equivalently, the working weight matrix *W* does not dependent on *β*. We calculate the negative Hessian (observed information matrix)

$$\widehat{\mathcal{Z}}(\mathfrak{F}) = \begin{array}{c} \ -\nabla\_{\mathfrak{F}}^{2} \ell\_{Y}(\mathfrak{F}) = \mathfrak{X}^{\top} \text{diag} \left( \frac{v\_{l}}{\varphi} \frac{Y\_{l}}{\mu\_{l}} \right)\_{1 \leq l \leq n} \mathfrak{X}. \end{array}$$

In the gamma model all observations *Yi* are strictly positive, a.s., and under the full rank assumption *<sup>q</sup>* <sup>+</sup> <sup>1</sup> <sup>≤</sup> *<sup>n</sup>*, the observed information matrix *<sup>I</sup>(β)* is positive definite, thus, we have a strictly concave log-likelihood function in the gamma case with log-link. -

*Example 5.6 (Tweedie's Models with Log-Link)* We study Tweedie's models for power variance parameters *p >* 1 as a single-parameter EDF model, see Sect. 2.2.3. The cumulant function *κp* is given in Table 4.1. This gives us the canonical link *θ* = *hp(μ)* <sup>=</sup> *<sup>μ</sup>*1−*p/(*<sup>1</sup> <sup>−</sup> *p) <* 0 for *μ >* 0 and *p >* 1. Moreover, we choose the loglink *η* = *g(μ)* = log*(μ)* for the GLM. This implies *θ* = exp{*(*1−*p)η*}*/(*1−*p) <* 0 for *p >* 1. We receive the score

$$\text{res}(\mathfrak{F}, Y) = \nabla\_{\mathfrak{F}} \ell\_Y(\mathfrak{F}) = \sum\_{i=1}^n \frac{v\_i}{\varphi} \frac{Y\_i - \mu\_i}{\mu\_i^{p-1}} \mathbf{x}\_i = \mathfrak{X}^\top \text{diag} \left( \frac{v\_l}{\varphi} \frac{1}{\mu\_l^{p-2}} \right)\_{1 \le i \le n} \mathbf{R}(Y, \mathfrak{F}).$$

We calculate the negative Hessian (observed information matrix) for *μi >* 0

$$\widehat{\mathcal{L}}(\mathfrak{f}) = -\nabla\_{\mathfrak{f}}^2 \ell\_Y(\mathfrak{f}) = \mathfrak{X}^\top \text{diag}\left(\frac{v\_l}{\varphi} \frac{(p-1)Y\_l - (p-2)\mu\_l}{\mu\_l^{p-1}}\right)\_{1 \le l \le n} \mathfrak{X}.$$

This matrix is positive definite for *p* ∈ [1*,* 2], and for *p >* 2 it is not positive definite because *(p*−1*)Yi*−*(p*−2*)μi* may have positive or negative values if we vary *μi >* 0 over its domain *M*. Thus, we do not have concavity of the optimization problem under the log-link choice in Tweedie's GLMs for power variance parameters *p >* 2. This in particular applies to the inverse Gaussian GLM with log-link. -

## *5.1.5 Balance Property Under the Canonical Link Choice*

Throughout this section we work under the canonical link choice *g* = *h* = *(κ )*−1. This choice has very favorable statistical properties. We have already seen in Remarks 5.4 that the derivation of the MLE of *β* becomes particularly easy under the canonical link choice and the observed information matrix *<sup>I</sup>(β)* coincides with Fisher's information matrix *I(β)* in this case, see (5.14).

For insurance pricing, canonical links have another very remarkable property, namely, that the estimated model automatically fulfills the balance property and, henceforth, is unbiased. This is particularly important in insurance pricing because it tells us that the insurance prices (over the entire portfolio) are on the right level. We have already met the balance property in Corollary 3.19.

**Corollary 5.7 (Balance Property)** *Assume that Y has independent components being modeled by a GLM under the canonical link choice g* = *h* = *(κ )*−1*. Assume that the MLE of regression parameter <sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+<sup>1</sup> *exists and denote it by <sup>β</sup>*MLE*. We have balance property on portfolio level (for constant dispersion ϕ)*

$$\sum\_{l=1}^{n} \mathbb{E}\_{\widehat{\mathfrak{P}}^{\mathrm{MLE}}} \left[ \upsilon\_{l} Y\_{l} \right] = \sum\_{l=1}^{n} \upsilon\_{l} \kappa' \langle \widehat{\mathfrak{P}}^{\mathrm{MLE}}, x\_{l} \rangle = \sum\_{l=1}^{n} \upsilon\_{l} Y\_{l}.$$

*Proof* The first column of the design matrix X is identically equal to 1 representing the intercept, see (5.4). The second part of Proposition 5.1 then provides for this first column of X, we cancel the (constant) dispersion *ϕ*,

$$(\mathfrak{l}(1, \ldots, \mathfrak{l}) \operatorname{diag}(v\_1, \ldots, v\_n) \kappa'(\mathfrak{X} \widehat{\mathfrak{f}}^{\operatorname{MLE}}) = (\mathfrak{l}, \ldots, \mathfrak{l}) \operatorname{diag}(v\_1, \ldots, v\_n) \,\, ^\vee \mathbf{Y} \,\, ^\vee$$

This proves the claim.

*Remark 5.8* We mention once more that this balance property is very strong and useful, see also Remarks 3.20. In particular, the balance property holds, even though the chosen GLM might be completely misspecified. Misspecification may include an incorrect distributional model, not the right link function choice, or if we have not pre-processed features appropriately, etc. Such misspecification will imply that we have a poor model on an insurance policy level (observation level). However, the total premium charged over the entire portfolio will be on the right level (supposed that the structure of the portfolio does not change) because it matches the observations, and henceforth, we have unbiasedness for the portfolio mean.

From the log-likelihood function (5.8) we see that under the canonical link choice we consider the statistics *S(Y)* <sup>=</sup> <sup>X</sup>diag*(vi/ϕ)*<sup>1</sup>≤*i*≤*n<sup>Y</sup>* <sup>∈</sup> <sup>R</sup>*q*+1, and to prove the balance property we have used the first component of this statistics. Considering all components, *S(Y)* is an unbiased estimator (decision rule) for

$$\mathbb{E}\_{\mathfrak{P}}\left[S(Y)\right] = \mathfrak{X}^{\top}\text{diag}(v\_{l}/\varphi)\_{1\le l\le n}\kappa'(\mathfrak{X}\mathfrak{P}) = \left(\sum\_{l=1}^{n}\frac{v\_{l}}{\varphi}\kappa'(\mathfrak{P},\mathfrak{x}\_{l})\chi\_{l,j}\right)^{\top}\_{0\le j\le q}.\tag{5.15}$$

This unbiased estimator *S(Y)* meets the Cramér–Rao information bound, hence it is UMVU: taking the partial derivatives of the previous expression gives <sup>∇</sup>*β*E*<sup>β</sup>* [*S(Y)*] <sup>=</sup> *<sup>I</sup>(β)*, the latter also being the multivariate Cramér–Rao information bound for the unbiased decision rule *S(Y)* for (5.15). Focusing on the first component we have

$$\text{Var}\_{\mathcal{B}}\left(\sum\_{l=1}^{n} \mathbb{E}\_{\widehat{\mathfrak{F}}} \mathbb{E}\_{l} \left[\upsilon\_{l} Y\_{l}\right]\right) = \text{Var}\_{\mathcal{B}}\left(\sum\_{l=1}^{n} \upsilon\_{l} Y\_{l}\right) = \sum\_{l=1}^{n} \varphi \upsilon\_{l} V(\mu\_{l}) = \varphi^{2} (\mathcal{Z}(\mathsf{f}))\_{0,0},\tag{5.16}$$

where the component *(*0*,* 0*)* in the last expression is the top-left entry of Fisher's information matrix *I(β)* under the canonical link choice.

## *5.1.6 Asymptotic Normality*

Formula (5.16) quantifies the uncertainty in the premium calculation of the insurance policies if we use the MLE estimated model (under the canonical link choice). That is, this quantifies the uncertainty in the dual mean parametrization in terms of the resulting variance. We could also focus on the MLE *<sup>β</sup>*MLE itself (for general link function *g*). In general, this MLE is not unbiased but we have consistency and asymptotic normality similar to Theorem 3.28. Under "certain regularity conditions"1 we have for *n* large

$$
\widehat{\mathfrak{F}}\_n^{\rm MLE} \stackrel{\rm (d)}{\approx} \mathcal{N}\left(\mathfrak{F}, \mathcal{Z}\_n(\mathfrak{F})^{-1}\right),
\tag{5.17}
$$

where *<sup>β</sup>*MLE *<sup>n</sup>* is the MLE based on the observations *Y<sup>n</sup>* = *(Y*1*,...,Yn)*-, and *In(β)* is Fisher's information matrix of *Yn*, which scales linearly in *n* in the homogeneous EF case, see Remarks 3.14, and in the homogeneous EDF case it scales as *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *vi*, see (3.25).

## *5.1.7 Maximum Likelihood Estimation and Unit Deviances*

From formula (5.7) we conclude that the MLE *<sup>β</sup>*MLE of *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+<sup>1</sup> is found by the solution of (subject to existence)

$$\widehat{\boldsymbol{\mathfrak{H}}}^{\text{MLE}} = \underset{\boldsymbol{\mathfrak{H}}}{\text{arg}\max} \,\ell \,\mathbf{y}(\boldsymbol{\mathfrak{H}}) \;= \underset{\boldsymbol{\mathfrak{H}}}{\text{arg}\max} \sum\_{i=1}^{n} \frac{\upsilon\_{i}}{\varphi} \Big[ Y\_{i} h(\mu(\mathbf{x}\_{i})) - \boldsymbol{\chi} \left( h(\mu(\mathbf{x}\_{i})) \right) \Big],$$

with *μi* <sup>=</sup> *μ(xi)* <sup>=</sup> <sup>E</sup>*θ (xi)* [*<sup>Y</sup>* ] <sup>=</sup> *<sup>g</sup>*−<sup>1</sup>*<sup>β</sup>, <sup>x</sup><sup>i</sup>* under the link choice *<sup>g</sup>*. If we prefer to work with an objective function that reflects the notion of a loss function, we can work under the unit deviances d*(Yi, μi)* studied in Sect. 4.1.2. The MLE is then obtained by, see (4.20)–(4.21),

$$\widehat{\mathfrak{J}}^{\text{MLE}} = \underset{\mathfrak{J}}{\text{arg}\, \text{max}} \, \ell\_Y(\mathfrak{J}) \, \, = \underset{\mathfrak{J}}{\text{arg}\, \text{min}} \, \sum\_{i=1}^n \frac{\upsilon\_i}{\varphi} \, \mathfrak{J}(Y\_i, \mu\_i), \tag{5.18}$$

the latter satisfying <sup>d</sup>*(Yi, μi)* <sup>≥</sup> <sup>0</sup> for all 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*, and being zero if and only if *Yi* = *μi*, see Lemma 2.22. Thus, using the unit deviances we have a loss function that is bounded below by zero, and we determine the regression parameter *β* such that this loss is (in-sample) minimized. This can also be interpreted in a more geometric way. Consider the *(q* <sup>+</sup> <sup>1</sup>*)*-dimensional manifold <sup>M</sup> <sup>⊂</sup> <sup>R</sup>*<sup>n</sup>* spanned by the GLM function

$$\mathfrak{g} \mapsto \mu(\mathfrak{g}) = \mathfrak{g}^{-1}(\mathfrak{X}\mathfrak{g}) = (\mathfrak{g}^{-1}\langle \mathfrak{F}, \mathfrak{x}\_{\mathbb{L}} \rangle, \dots, \mathfrak{g}^{-1}\langle \mathfrak{F}, \mathfrak{x}\_{\mathbb{n}} \rangle)^{\top} \in \mathbb{R}^{n}. \tag{5.19}$$

<sup>1</sup> The regularity conditions for asymptotic normality results will depend on the particular regression problem studied, we refer to pages 43–44 in Fahrmeir–Tutz [123].

Minimization (5.18) then tries to find the point *<sup>μ</sup>(β)* in this manifold <sup>M</sup> <sup>⊂</sup> <sup>R</sup>*<sup>n</sup>* that minimizes simultaneously all unit deviances <sup>d</sup>*(Yi,*·*)* w.r.t. the observation *<sup>Y</sup>* <sup>=</sup> *(Y*1*,...,Yn)* - <sup>∈</sup> <sup>R</sup>*n*. Or in other words, the optimal parameter *<sup>β</sup>* is obtained by "projecting" observation *Y* onto this manifold M, where "projection" is understood as a simultaneous minimization of loss function *<sup>n</sup> i*=1 *vi <sup>ϕ</sup>* <sup>d</sup>*(Yi, μi)*, see Fig. 5.2. In the un-weighted Gaussian case, this corresponds to the usual orthogonal projection as the next example shows, and in the non-Gaussian case it is understood in the KL divergence minimization sense as displayed in formula (4.11).

*Example 5.9 (Gaussian Case)* Assume we have the Gaussian EDF case *κ(θ )* = *<sup>θ</sup>* <sup>2</sup>*/*2 with canonical link *g(μ)* <sup>=</sup> *h(μ)* <sup>=</sup> *<sup>μ</sup>*. In this case, the manifold (5.19) is the linear space spanned by the columns of the design matrix X

$$\mathfrak{g} \mapsto \mu(\mathfrak{g}) = \mathfrak{X}\mathfrak{g} = \left( \langle \mathfrak{g}, \mathfrak{x}\_{\mathbb{I}} \rangle, \dots, \langle \mathfrak{g}, \mathfrak{x}\_{\mathbb{n}} \rangle \right)^{\top} \in \mathbb{R}^{n}.$$

If additionally we assume *vi/ϕ* = *c >* 0 for all 1 ≤ *i* ≤ *n*, the minimization problem (5.18) reads as

$$\widehat{\mathfrak{F}}^{\text{ML.E}} \;= \operatorname\*{arg\,min}\_{\mathfrak{F}} \sum\_{l=1}^{n} \frac{v\_l}{\varphi} \mathfrak{d}(Y\_l, \mu\_l) \;= \operatorname\*{arg\,min}\_{\mathfrak{F}} \|Y - \mathfrak{X}\mathfrak{f}\|\_2^2.$$

where we have used that the unit deviances in the Gaussian case are given by the square loss function, see Example 4.12. As a consequence, the MLE *<sup>β</sup>*MLE is found by orthogonally projecting *<sup>Y</sup>* onto <sup>M</sup> = {X*β*<sup>|</sup> *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+1} ⊂ <sup>R</sup>*n*, and this orthogonal projection is given by <sup>X</sup> *<sup>β</sup>*MLE <sup>∈</sup> <sup>M</sup>. -

## **5.2 Actuarial Applications of Generalized Linear Models**

The purpose of this section is to illustrate how the concept of GLMs is used in actuarial modeling. We therefore explore the typical actuarial examples of claim counts and claim size modeling.

## *5.2.1 Selection of a Generalized Linear Model*

The selection of a predictive model within GLMs for solving an applied actuarial problem requires the following choices.

**Choice of the Member of the EDF** Select a member of the EDF that fits the modeling problem. In a first step, we should try to understand the properties of the data *Y* before doing this selection, for instance, do we have count data, do we have a classification problem, do we have continuous observations?

All members of the EDF are light-tailed because the moment generating function exists around the origin, see Corollary 2.14, and the EDF is not suited to model heavy-tailed data, for instance, having a regularly varying tail. Therefore, a datum *Y* is sometimes first transformed before being modeled by a member of the EDF. A popular transformation is the logarithm for positive observations. After this transformation a member of the EDF can be chosen to model log*(Y )*. For instance, if we choose the Gaussian distribution for log*(Y )*, then *Y* will be log-normally distributed, or if we choose the exponential distribution for log*(Y )*, then *Y* will be Pareto distributed, see Sect. 2.2.5. One can then model the transformed datum with a GLM. Often this provides very accurate models, say, on the log scale for the log-transformed data. There is one issue with this approach, namely, if a model is unbiased on the transformed scale then it is typically biased on the original observation scale; if the transformation is concave this easily follows from Jensen's inequality. The problematic part now is that the bias correction itself often has systematic effects which means that the transformation (or the involved nuisance parameters) should be modeled with a regression model, too, see Sect. 5.3.9. In many cases this will not easily work, unfortunately. Therefore, if possible, clear preference should be given to modeling the data on the original observation scale (if unbiasedness is a central requirement).

**Choice of Link Function** From a statistical point of view we should choose the canonical link *g* = *h* to connect the mean *μ* of the model to the linear predictor *η* because this implies many favorable mathematical properties. However, as seen, sometimes we have different needs. Practical reasons may require that we have a model with additive or multiplicative effects, which favors the identity or the loglink, respectively. Another requirement is that the resulting canonical parameter *θ* = *(h* ◦ *<sup>g</sup>*−1*)(η)* needs to be within the effective domain . If this effective domain is bounded, for instance, if it covers the negative real line as for the gamma model, a (transformation of the) log-link might be more suitable than the canonical link because *<sup>g</sup>*−1*(*·*)* = − exp*(*·*)* has a strictly negative range, see Example 5.5.

**Choice of Features and Feature Engineering** Assume we have selected the member of the EDF and the link function *g*. This gives us the relationship between the mean *μ* and the linear predictor *η*, see (5.5),

$$
\mu(\mathbf{x}) = \mathbb{E}\_{\theta(\mathbf{x})} \left[ Y \right] = \mathbf{g}^{-1}(\eta(\mathbf{x})) = \mathbf{g}^{-1} \langle \boldsymbol{\theta}, \mathbf{x} \rangle. \tag{5.20}
$$

Thus, the features *<sup>x</sup>* <sup>∈</sup> *<sup>X</sup>* <sup>⊂</sup> <sup>R</sup>*q*+<sup>1</sup> need to be in the right functional form so that they can appropriately describe the systematic effect via the function (5.20). We distinguish the following feature types:


All these components need to be brought into a suitable form so that they can be used in a linear predictor *η(x)* = *β, x* , see (5.20). This requires the consideration of the following points (1) transformation of continuous components so that they can describe the systematic effects in a linear form, (2) transformation of categorical components to real-valued components, (3) interaction of components beyond an additive structure in the linear predictor, and (4) the resulting design matrix X should have full rank *q* + 1 ≤ *n*. We are going to describe these points (1)–(4) in the next section.

## *5.2.2 Feature Engineering*

#### **Categorical Feature Components: Dummy Coding**

Categorical variables need to be embedded into a Euclidean space. This embedding needs to be done such that the resulting design matrix <sup>X</sup> has full rank *<sup>q</sup>* <sup>+</sup> <sup>1</sup> <sup>≤</sup> *<sup>n</sup>*. There are many different ways to do so, and the particular choice depends on the modeling purpose. The most popular way is *dummy coding*. We only describe dummy coding here because it is sufficient for our purposes, but we mention that


there are also other codings like effects coding or Helmert's contrast coding.2 The choice of the coding will not influence the predictive model (if we work with a full rank design matrix), but it may influence parameter selection, parameter reduction and model interpretation. For instance, the choice of the coding is (more) important in medical studies where one tries to understand the effects between certain therapies.

Assume that the raw feature component \**xj* is a categorical variable taking *<sup>K</sup>* different levels {*a*1*,...,aK*}. For dummy coding we declare one level, say *aK*, to be the reference level and all other levels are described relative to that reference level. Formally, this can be described by an embedding map

$$\widetilde{\mathbf{x}}\_{\!\!\!\!\!x} \mapsto \mathbf{x}\_{\!\!\!\!x} = (\mathbb{1}\_{\{\widetilde{\mathbf{x}} \!\!\!\!\/=a\_1\}}, \dots, \mathbb{1}\_{\{\widetilde{\mathbf{x}} \!\!\!\/=a\_{K-1}\}})^{\top} \in \mathbb{R}^{K-1}.\tag{5.21}$$

This is closely related to the categorical distribution in Sect. 2.1.4. An explicit example is given in Table 5.1.

*Example 5.10 (Multiplicative Model)* If we choose the log-link function *η* = *g(μ)* = log*(μ)*, we receive the regression function for the categorical example of Table 5.1

$$\widetilde{\chi\_j} \mapsto \exp\langle \boldsymbol{\beta}, \mathbf{x}\_j \rangle = \exp\{\beta\_0\} \prod\_{k=1}^{K-1} \exp\left\{ \beta\_k \mathbb{1}\_{\{\widetilde{\mathbf{x}}\_j = a\_k\}} \right\},\tag{5.22}$$

including an intercept component. Thus, the base value exp{*β*0} is determined by the reference level *a*<sup>11</sup> = brown, and any color different from brown has a deviation from the base value described by the multiplicative correction term exp{*βk*1{\**xj*=*ak* }}. -

<sup>2</sup> There is an example of Helmert's contrast coding in Remarks 2.7 of lecture notes [392], and for more examples we refer to the UCLA statistical consulting website: https://stats.idre.ucla.edu/r/ library/r-library-contrast-coding-systems-for-categorical-variables/.

#### *Remarks 5.11*


*Example 5.12 (Balance Property and Dummy Coding)* A main argument for the use of the canonical link function has been the fulfillment of the balance property, see Corollary 5.7. If we have categorical feature components and if we apply dummy coding to those, then the balance property is projected down to the individual levels of that categorical variable. Assume that columns 2 to *K* of design matrix X are used to model a raw categorical feature \**x*<sup>1</sup> with *<sup>K</sup>* levels according to (5.21). In that case, columns 2 ≤ *k* ≤ *K* will indicate all observations *Yi* which belong to levels *ak*−1. Analogously to the proof of Corollary 5.7, we receive (summation *i* runs over the different instances/policies)

$$\sum\_{l:\ \widetilde{\chi}\_{l,1}=a\_{k-1}} \mathbb{E}\_{\widetilde{\mathfrak{P}}^{\text{MLE}}} \left[ \upsilon\_{l} Y\_{l} \right] = \sum\_{l=1}^{n} \chi\_{l,k} \mathbb{E}\_{\widetilde{\mathfrak{P}}^{\text{MLE}}} \left[ \upsilon\_{l} Y\_{l} \right] = \sum\_{l=1}^{n} \chi\_{l,k} \upsilon\_{l} Y\_{l} = \sum\_{l:\ \widetilde{\chi}\_{l,1}=a\_{k-1}} \upsilon\_{l} Y\_{l}. \tag{5.23}$$

Thus, we receive the balance property for all policies 1 ≤ *i* ≤ *n* that belong to level *ak*−1.

If we have many levels, then it will happen that some levels have only very few observations, and the above summation (5.23) only runs over very few insurance policies with \**xi,*<sup>1</sup> <sup>=</sup> *ak*−1. Suppose additionally the volumes *vi* are small. This can lead to considerable estimation uncertainty, because the estimated prices on the lefthand side of (5.23) will be based too much on individual observations *Yi* having the corresponding level, and we are not in the regime of a law of large numbers that balances these observations.

Thus, this balance property from dummy coding is a natural property under the canonical link choice. Actuarial pricing is very familiar with such a property. Early distribution-free approaches have postulated this property resulting in the method of the total marginal sums, see Bailey and Jung [22, 206], where the balance property is enforced for marginal sums of all categorical levels in parameter estimation. However, if we have scarce levels in categorical variables, this approach needs careful consideration. -

#### **Binary Feature Components**

Binary feature components do not need a treatment different from the categorical ones, they are Bernoulli variables which can be encoded as 0 or 1. This is exactly dummy coding for *K* = 2 levels.

#### **Continuous Feature Components**

Continuous feature components are already real-valued. Therefore, from the viewpoint of 'variable types', the continuous feature components do not need any pre-processing because they are already in the right format to be included in scalar products.

Nevertheless, in many cases, also continuous feature components need feature engineering because only in rare cases they directly fit the functional form (5.20). We give an example. Consider car drivers that have different driving experience and different driving skills. To explain experience and skills we typically choose the age of driver as explanatory variable. Modeling the claim frequency as a function of the age of driver, we often observe a U-shaped function, thus, a function that is nonmonotone in the age of driver variable. Since the link function *g* needs to be strictly monotone, this regression problem cannot be modeled by (5.20), directly including the age of driver as a feature because this leads to monotonicity of the regression function in the age of driver variable.

Typically, in such situations, the continuous variable is discretized to categorical classes. In the driver's age example, we build age classes. These age classes are then treated as categorical variables using dummy coding (5.21). We will give examples below. These age classes should fulfill the requirement of being sufficiently homogeneous in the sense that insurance policies that fall into the same class should have a similar propensity to claims. This implies that we would like to have many small homogeneous classes. However, the classes should be sufficiently large, otherwise parameter estimation involves high uncertainty, see also Example 5.12. Thus, there is a trade-off to sufficiently meet both of these two requirements.

A disadvantage of this discretization approach is that neighboring age classes will not be recognized by the regression function because, per se, dummy coding is based on nominal variables not having any topology. This is also illustrated by the fact, that all categorical levels (excluding the reference level) have, in view of embedding (5.21), the same mutual Euclidean distance. Therefore, in some applications, one prefers a different approach by rather trying to find an appropriate functional form. For instance, we can pre-process a strictly positive raw feature component \**xl* to a higher-dimensional functional form

$$
\widetilde{\boldsymbol{x}}\_{l} \mapsto \beta\_{1}\widetilde{\boldsymbol{x}}\_{l} + \beta\_{2}\widetilde{\boldsymbol{x}}\_{l}^{2} + \beta\_{3}\widetilde{\boldsymbol{x}}\_{l}^{3} + \beta\_{4}\log(\widetilde{\boldsymbol{x}}\_{l}),\tag{5.24}
$$

with regression parameter *(β*1*,...,β*4*)*-, i.e., we have a polynomial function of degree 3 plus a logarithmic term in this choice. If one does not want to choose a specific functional form, one often chooses natural cubic splines. This, together with regularization, leads to the framework of generalized additive models (GAMs), which is popular family of regression models besides GLMs; for literature on GAMs we refer to Hastie–Tibshirani [182], Wood [384], Ohlsson–Johansson [290], Denuit et al. [99] and Wüthrich–Buser [392]. In these notes we will not further pursue GAMs.

*Example 5.13 (Multiplicative Model)* If we choose the log-link function *η* = *g(μ)* = log*(μ)* we receive a multiplicative regression function

$$\mu \mapsto \mu(\mathbf{x}) = \exp\langle \boldsymbol{\theta}, \mathbf{x} \rangle = \exp\{\beta\_0\} \prod\_{j=1}^{q} \exp\left\{\beta\_j \mathbf{x}\_j\right\}.$$

That is, all feature components *xj* enter the regression function in an exponential form. In general insurance, one may have specific variables for which it is explicitly known that they should enter the regression function as a power function. Having a raw feature \**xl* we can pre-process it as \**xl* <sup>→</sup> *xl* <sup>=</sup> log*(*\**xl)*. This implies

$$\mu(\mathfrak{x}) = \exp\langle \mathfrak{x}, \mathfrak{x} \rangle = \exp\{\beta\_0\} \,\, \widetilde{\chi}\_l^{\beta\_l} \prod\_{j=1,\ j \neq l}^q \exp\left\{\beta\_j \chi\_j\right\},$$

which gives a power term of order *βl*. The GLM estimates in this case the power parameter that should be used for \**xl*. If the power parameter is known, then one can even include this component as an offset; offsets are discussed in Sect. 5.2.3, below. -

#### **Interactions**

Naturally, GLMs only allow for an additive structure in the linear predictor. Similar to continuous feature components, such an additive structure may not always be suitable and one wants to model more complex interaction terms. Such interactions need to be added manually by the modeler, for instance, if we have two raw feature components \**xl* and \**xk*, we may want to consider a functional form

$$(\widetilde{\mathbf{x}}\_l, \widetilde{\mathbf{x}}\_k) \mapsto \ \beta\_1 \widetilde{\mathbf{x}}\_l + \beta\_2 \widetilde{\mathbf{x}}\_k + \beta\_3 \widetilde{\mathbf{x}}\_l \widetilde{\mathbf{x}}\_k + \beta\_4 \widetilde{\mathbf{x}}\_l^2 \widetilde{\mathbf{x}}\_k,$$

with regression parameter *(β*1*,...,β*4*)*-.

More generally, this manual feature engineering of adding interactions and of specifying functional forms (5.24) can be understood as a new representation of raw features. Representation learning in relation to deep learning is going to be discussed in Sect. 7.1, and this discussion is also related to Mercer's kernels.

## *5.2.3 Offsets*

In many heterogeneous portfolio problems with observations *Y* = *(Y*1*,...,Yn)*-, there are known prior differences between the individual risks *Yi*, for instance, the time exposure varies between the different policies *i*. Such known prior differences can be integrated into the predictors, and this integration typically does not involve any additional model parameters. A simple way is to use an *offset* (constant) in the linear predictor of a GLM. Assume that each observation *Yi* is equipped with a feature *<sup>x</sup><sup>i</sup>* <sup>∈</sup> *<sup>X</sup>* and a known offset *oi* <sup>∈</sup> <sup>R</sup> such that the linear predictor *ηi* takes the form

$$g(\mathbf{x}\_l, o\_l) \leftrightarrow \ g(\mu\_l) = \eta\_l = \eta(\mathbf{x}\_l, o\_l) = o\_l + \langle \mathfrak{B}, \mathbf{x}\_l \rangle,\tag{5.25}$$

for all 1 ≤ *i* ≤ *n*. An offset *oi* does not change anything from a structural viewpoint, in fact, it could be integrated into the feature *x<sup>i</sup>* with a regression parameter that is identically equal to 1.

Offsets are frequently used in Poisson models with the (canonical) log-link choice to model multiplicative time exposures in claim frequency modeling. Under the log-link choice we receive from (5.25) the following mean function

$$\mu(\mathbf{x}\_l, o\_l) \mapsto \mu(\mathbf{x}\_l, o\_l) = \exp\{\eta(\mathbf{x}\_l, o\_l)\} = \exp\{o\_l + \langle \mathfrak{P}, \mathbf{x}\_l \rangle\} = \exp\{o\_l\} \exp\langle \mathfrak{P}, \mathbf{x}\_l \rangle.$$

In this version, the offset *oi* provides us with an exposure exp{*oi*} that acts multiplicatively on the regression function. If *wi* = exp{*oi*} measures time, then *wi* is a so-called pro-rata temporis (proportional in time) exposure.

*Remark 5.14 (Boosting)* A popular machine learning technique in statistical modeling is boosting. Boosting tries to step-wise adaptively improve a regression model. Offsets (5.25) are a simple way of constructing boosted models. Assume we have constructed a predictive model using any statistical model, and denote the resulting estimated means of *Yi* by *<sup>μ</sup> <sup>i</sup> (*0*)* . The idea of boosting is that we select another statistical model and we try to see whether this second model can still find systematic structure in the data which has not been found by the first model. In view of (5.25), we include the first model into the offset and we build a second model around this offset, that is, we may explore a GLM

$$
\widehat{\mu\_i}^{(1)} = \operatorname{g}^{-1} \left( \operatorname{g} (\widehat{\mu\_i}^{(0)}) + \langle \mathfrak{f}, \mathfrak{x}\_i \rangle \right).
$$

If the first model is perfect we come up with a regression parameter *β* = 0, otherwise the linear predictor *β, x<sup>i</sup>* of the second model starts to compensate for weaknesses in *<sup>μ</sup> <sup>i</sup> (*0*)* . Of course, this boosting procedure can then be iterated and one should stop boosting before the resulting model starts to over-fit to the data. Typically, this approach is applied to regression trees instead of GLMs, see Ferrario–Hämmerli [125], Section 7.4 in Wüthrich–Buser [392], Lee–Lin [241] and Denuit et al. [100].

## *5.2.4 Lab: Poisson GLM for Car Insurance Frequencies*

We present a first GLM example. This example is based on French motor third party liability (MTPL) insurance claim counts data. The data is described in detail in Chap. 13.1; an excerpt of the available MTPL data is given in Listing 13.2. For the moment we only consider claim frequency modeling. We use the following data: *Ni* describes the number of claims, *vi* ∈ *(*0*,* 1] describes the duration of the insurance policy, and \**x<sup>i</sup>* describes the available raw feature information of insurance policy *<sup>i</sup>*, see Listing 13.2.

We are going to model the claim counts *Ni* with a Poisson GLM using the canonical link function of the Poisson model. In the Poisson approach there are two different ways to account for the duration of the insurance policy. Either we model *Yi* = *Ni/vi* with the Poisson model of the EDF, see Sect. 2.2.2 and Remarks 2.13 (reproductive form), or we directly model *Ni* with the Poisson distribution from the EF and treat the log-duration as an offset variable *oi* = log *vi*. In the first approach we have for the log-link choice *g(*·*)* = *h(*·*)* = log*(*·*)* and dispersion *ϕ* = 1

$$Y\_l = N\_l / \upsilon\_l \sim f(\mathbf{y}\_l; \theta\_l, \upsilon\_l) = \exp\left\{ \frac{\mathbf{y}\_l \langle \mathbf{\mathcal{B}}, \mathbf{x}\_l \rangle - e^{\langle \mathbf{\mathcal{B}}, \mathbf{x}\_l \rangle}}{1/\upsilon\_l} + a(\mathbf{y}\_l; \upsilon\_l) \right\},\tag{5.26}$$

where *x<sup>i</sup>* ∈ *X* is the suitably pre-processed feature information of insurance policy *i*, and with canonical parameter *θi* = *η(xi)* = *β, x<sup>i</sup>* . In the second approach we include the log-duration as offset into the regression function and model *Ni* with the Poisson distribution from the EF. Using notation (2.2) this gives us

$$N\_{l} \sim f(n\_{l}; \theta\_{l}) = \exp\left\{n\_{l} \left(\log v\_{l} + \langle \boldsymbol{\mathcal{B}}, \mathbf{x}\_{l} \rangle \right) - e^{\log v\_{l} + \langle \boldsymbol{\mathcal{B}}, \mathbf{x}\_{l} \rangle} + a(n\_{l})\right\}(\mathbf{S}.27)$$

$$= \exp\left\{\frac{\frac{n\_{l}}{v\_{l}} \langle \boldsymbol{\mathcal{B}}, \mathbf{x}\_{l} \rangle - e^{\langle \boldsymbol{\mathcal{B}}, \mathbf{x}\_{l} \rangle}}{1/v\_{l}} + a(n\_{l}) + n\_{l} \log v\_{l}\right\},$$

with canonical parameter *θi* = *η(xi, oi)* = *oi* + *β, xi* = log *vi* + *β, x<sup>i</sup>* for observation *ni* = *viyi*. That is, we receive the same model in both cases (5.26) and (5.27) under the canonical log-link choice for the Poisson GLM.

Finally, we make the assumption that all observations *Ni* are independent. There remains the pre-processing of the raw features \**x<sup>i</sup>* to features *<sup>x</sup><sup>i</sup>* so that they can be used in a sensible way in the linear predictors *ηi* = *η(xi, oi)* = *oi* + *β, x<sup>i</sup>* .

#### **Feature Engineering**

Categorical and Binary Variables: Dummy Coding

For categorical and binary variables we use dummy coding as described in Sect. 5.2.2. We have two categorical variables VehBrand and Region, as well as a binary variable VehGas, see Listing 13.2. We choose the first level as reference level, and the remaining levels are characterized by *(K* − 1*)*-dimensional embeddings (5.21). This provides us with *K* − 1 = 10 parameters for VehBrand, *K* − 1 = 21 parameters for Region and *K* − 1 = 1 parameter for VehGas.

Figure 5.3 shows the empirical marginal frequencies *λ* = *Ni/ vi* on all levels of the categorical feature components VehBrand, Region and VehGas. Moreover, the blue areas (in the colored version) give confidence bounds of ±2 6 *λ/vi* (under a Poisson assumption), see Example 3.22. The more narrow these confidence bounds, the bigger the volumes *vi* behind these empirical marginal estimates.

**Fig. 5.3** Empirical marginal frequencies on each level of the categorical variables (lhs) VehBrand, (middle) Region, and (rhs) VehGas

#### Continuous Variables

We consider feature engineering of the continuous variables Area, VehPower, VehAge, DrivAge, BonusMalus and log-Density (Density on the log scale); note that we map the Area codes *(A, . . . , F )* → *(*1*,...,* 6*)*. Some of these variables do not show any monotonicity nor log-linearity in the empirical marginal frequency plots, see Fig. 5.4.

These non-monotonicity and non-log-linearity suggest in a first step to build homogeneous classes for these feature components and use dummy coding for the resulting classes. We make the following choices here (motivated by the marginal graphs of Fig. 5.4):


This encoding is slightly different from Noll et al. [287] because of different data cleaning. The discretization has been chosen quite ad-hoc by just looking at the empirical plots; as illustrated in Section 6.1.6 of Wüthrich–Buser [392] regression trees may provide an algorithmic way of choosing homogeneous classes of sufficient volume. This provides us with a feature space (the initial component stands for the intercept *xi,*<sup>0</sup> = 1 and the order of the terms is the same as in Listing 13.2)

$$X \subset \{1\} \times \mathbb{R} \times \{0, 1\}^{\mathbb{S}} \times \{0, 1\}^2 \times \{0, 1\}^6 \times \mathbb{R} \times \{0, 1\}^{10} \times \{0, 1\} \times \mathbb{R} \times \{0, 1\}^{21},$$

of dimension *q* +1 = 1+1+5+2+6+1+10+1+1+21 = 49. The R code [307] for this pre-processing of continuous variables is shown in Listing 5.1, categorical variables do not need any special treatment because variables of factor type are consider internally in R by dummy coding; we call this model Poisson GLM1.

#### **Choice of Learning and Test Samples**

To measure predictive performance we follow the generalization approach as proposed in Chap. 4. This requires that we partition our entire data into learning sample *L* and test sample *T* , see Fig. 4.1. Model selection and model fitting will be done on the learning sample *L*, only, and the test sample *T* is used to analyze the generalization of the fitted models to unseen data. We partition the data at random (non-stratified) in a ratio of 9 : 1, and we are going to hold on to the same partitioning throughout this monograph whenever we study this example. The R code used is given in Listing 5.2.

**Listing 5.1** Pre-processing of features for model Poisson GLM1 in R


Table 5.2 shows the summary of the chosen partition into learning and test samples

$$\mathcal{L} = \left\{ (Y\_l = N\_l/v\_l, \mathbf{x}\_l, v\_l) \, : \, i = 1, \dots, n = 610^\prime 206 \right\},$$

and

$$\mathcal{T} = \left\{ (Y\_t^\dagger = N\_t^\dagger / v\_t^\dagger, \mathbf{x}\_t^\dagger, v\_t^\dagger) \; : \; t = 1, \dots, T = 67'801 \right\}.$$

In contrast to Sect. 4.2 we also include feature information and exposure information to *L* and *T* .

**Listing 5.2** Partition of the data to learning sample *L* and test sample *T*

```
1 RNGversion("3.5.0") # we use R version 3.5.0 for this partition
2 set.seed(500)
3 ll <- sample(c(1:nrow(dat)), round(0.9*nrow(dat)), replace = FALSE)
4 learn <- dat[ll,]
5 test <- dat[-ll,]
```
**Table 5.2** Choice of learning data set *L* and test data set *T* ; the empirical frequency on both data sets is similar (last column), and the split of the policies w.r.t. the numbers of claims is also rather similar


#### **Maximum-Likelihood Estimation and Results**

The remaining step is to perform MLE to estimate regression parameter *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+1. This can be done either by maximizing the Poisson log-likelihood function or by minimizing the Poisson deviance loss. In view of (4.9) and Example 4.27, the Poisson deviance loss on the learning data *L* is given by

$$\mathfrak{F} \mapsto \mathfrak{D}(\mathcal{L}, \mathfrak{F}) = \frac{2}{n} \sum\_{l=1}^{n} v\_l \left( \mu(\mathfrak{x}\_l) - Y\_l - Y\_l \log \left( \frac{\mu(\mathfrak{x}\_l)}{Y\_l} \right) \right) \tag{5.28}$$

where the terms under the summation are set equal to *viμ(xi)* for *Yi* = 0, see (4.8), and we have GLM regression function

$$\mathbf{x} \mapsto \mu(\mathbf{x}) = \mu\_{\mathcal{B}}(\mathbf{x}) = \exp\langle \mathbf{\mathcal{B}}, \mathbf{x} \rangle.$$

That is, we work under the canonical link with the canonical parameter being equal to the linear predictor. The MLE of *β* is found by minimizing (5.28). This is done with Fisher's scoring method. In order to receive a non-degenerate solution we need to ensure that we have sufficiently many claims *Yi >* 0, otherwise it might happen that the MLE provides a (degenerate) solution at the boundary of the effective domain . We denote the MLE by *<sup>β</sup>*MLE *<sup>L</sup>* <sup>=</sup> *<sup>β</sup>*MLE, because it has been estimated on the learning data *L*, only. This gives us estimated regression function

$$\mathbf{x} \mapsto \widehat{\mu}(\mathbf{x}) = \mu\_{\widehat{\mathcal{B}}\_{\mathcal{L}}^{\mathrm{MLE}}}(\mathbf{x}) = \exp\langle \widehat{\mathcal{B}}\_{\mathcal{L}}^{\mathrm{MLE}}, \mathbf{x} \rangle.$$

We emphasize that we only use the learning data *L* for this model fitting. In view of Definition 4.24 we receive in-sample and out-of-sample Poisson deviance losses

$$\begin{split} \mathfrak{D}(\mathcal{L}, \widehat{\boldsymbol{\theta}}\_{\mathcal{L}}^{\text{MLE}}) &= \frac{2}{n} \sum\_{l=1}^{n} \boldsymbol{v}\_{l} \left( \widehat{\boldsymbol{\mu}}(\mathbf{x}\_{l}) - \boldsymbol{Y}\_{l} - \boldsymbol{Y}\_{l} \log \left( \frac{\widehat{\boldsymbol{\mu}}(\mathbf{x}\_{l})}{\boldsymbol{Y}\_{l}} \right) \right) \geq 0, \\ \mathfrak{D}(\boldsymbol{\mathcal{T}}, \widehat{\boldsymbol{\theta}}\_{\mathcal{L}}^{\text{MLE}}) &= \frac{2}{T} \sum\_{l=1}^{T} \boldsymbol{v}\_{l}^{\dagger} \left( \widehat{\boldsymbol{\mu}}(\mathbf{x}\_{l}^{\dagger}) - \boldsymbol{Y}\_{l}^{\dagger} - \boldsymbol{Y}\_{l}^{\dagger} \log \left( \frac{\widehat{\boldsymbol{\mu}}(\mathbf{x}\_{l}^{\dagger})}{\boldsymbol{Y}\_{l}^{\dagger}} \right) \right) \geq 0. \end{split}$$

We implement this GLM on the data of Listing 5.1 (and including the categorical features) in R using the function glm [307], a short overview of the results is presented in Listing 5.3. This overview presents the regression model implemented, an excerpt of the parameter estimates *<sup>β</sup>*MLE *<sup>L</sup>* , standard errors which are received from the square-rooted diagonal entries of the inverse of the estimated Fisher's information matrix *<sup>I</sup>n( <sup>β</sup>*MLE *<sup>L</sup> )*, see (5.17); the remaining columns will be described in Sect. 5.3.2 on the Wald test (5.33). The bottom line of the output says that Fisher's scoring algorithm has converged in 6 iterations, it gives the in-sample deviance loss *<sup>n</sup>*D*(L, <sup>β</sup>*MLE *<sup>L</sup> )* called Residual deviance (not being scaled by the number of **Listing 5.3** Results in model Poisson GLM1 using the R command glm

```
1 Call:
2 glm(formula = ClaimNb ~ VehPowerGLM + VehAgeGLM + DrivAgeGLM +
3 BonusMalusGLM + VehBrand + VehGas + DensityGLM + Region +
4 AreaGLM, family = poisson(), data = learn, offset = log(Exposure))
5
6 Deviance Residuals:
7 Min 1Q Median 3Q Max
8 -1.4728 -0.3256 -0.2456 -0.1383 7.7971
9
10 Coefficients:
11 Estimate Std. Error z value Pr(>!z!)
12 (Intercept) -4.8175439 0.0579296 -83.162 < 2e-16 *** 13 VehPowerGLM5 0.0604293 0.0229841 2.629 0.008559 ** 14 VehPowerGLM6 0.0868252 0.0225509 3.850 0.000118 *** 15 ...
16 ...
17 RegionR93 0.1388160 0.0294901 4.707 2.51e-06 *** 18 RegionR94 0.1918538 0.0938250 2.045 0.040874 * 19 AreaGLM 0.0407973 0.0200818 2.032 0.042199 * 20 ---
21 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
22
23 (Dispersion parameter for poisson family taken to be 1)
24
25 Null deviance: 153852 on 610205 degrees of freedom
26 Residual deviance: 147069 on 610157 degrees of freedom
27 AIC: 192818
28
29 Number of Fisher Scoring iterations: 6
```
**Table 5.3** Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses, tenfold cross-validation losses with empirical standard deviation in brackets, see also (4.36), (units are in 10−2) and the in-sample average frequency of the null model (Poisson intercept model, see Example 4.27) and of model Poisson GLM1


observations), as well as Akaike's Information Criterion (AIC), see Sect. 4.2.3 for AIC. Note that we have implemented Poisson version (5.27) with the exposures entering the offset, see lines 2–4 of Listing 5.3; this is important for understanding AIC being calculated on the (unscaled) claim counts *Ni*.

Table 5.3 summarizes the results of model Poisson GLM1 and it compares the figures to the null model (only having an intercept *β*0); the null model has already been introduced in Example 4.27. We present the run time needed to fit the model,<sup>3</sup> the number of regression parameters *<sup>q</sup>* <sup>+</sup> 1 in *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+1, AIC, in-sample and out-of-sample deviance losses, as well as tenfold cross-validation losses on the

<sup>3</sup> All run times are measured on a personal laptop Intel(R) Core(TM) i7-8550U CPU @ 1.80 GHz 1.99 GHz with 16 GB RAM, and they only correspond to fitting the model (or the corresponding step) once, i.e., they do not account for multiple runs, for instance, for *K*-fold cross-validation.

learning data *L*. For tenfold cross-validation we always use the same (non-stratified) partition of *L* (in all examples in this monograph), and in bracket we show the empirical standard deviation received by (4.36). Tenfold cross-validation would not be necessary in this case because we have test data *T* on which we can evaluate the out-of-sample deviance GL. We present both figures to back-test whether tenfold cross-validation works properly in our example. We observe that the out-of-sample deviance losses <sup>D</sup>*(<sup>T</sup> , <sup>β</sup>*MLE *<sup>L</sup> )* are within one empirical standard deviation of the tenfold cross-validation losses <sup>D</sup> CV, which supports this methodology of model comparison.

From Table 5.3 we conclude that we should prefer model Poisson GLM1 over the null model, this decision is supported by a smaller AIC, a smaller out-of-sample deviance loss <sup>D</sup>*(<sup>T</sup> , <sup>β</sup>*MLE *<sup>L</sup> )* as well as a smaller cross-validation loss <sup>D</sup> CV. The last column of Table 5.3 confirms that the estimated model meets the balance property (we work with the canonical link here). Note that this balance property should be fulfilled for two reasons. Firstly, we would like to have the overall portfolio price on the right level, and secondly, deviance losses should only be compared on the same overall frequency, see Example 4.10.

Before we continue to introduce more models to challenge model Poisson GLM1, we are going to discuss statistical tools for model evaluation. Of course, we would like to know whether model Poisson GLM1 is a good model for this data or whether it is just the better model of two bad options.

*Remark 5.15 (Prior and Posterior Information)* Pricing literature distinguishes between prior feature information and posterior feature information, see Verschuren [372]. Prior feature information is available at the inception of the (new) insurance contract before having any claims history. This includes, for instance, age of driver, vehicle brand, etc. For policy renewals, past claims history is available and prices of policy renewals can also be based on such posterior information. Past claims history has led to the development of so-called bonus-malus systems (BMS) which often are in the form of multiplicative factors to the base premium to reward and punish good and bad past experience, respectively. One stream of literature studies optimal designs of BMS, we refer to Loimaranta [255], De Pril [91], Lemaire [245], Denuit et al. [102], Brouhns et al. [57] Pinquet [304], Pinquet et al. [305], Tzougas et al. [360] or Ágoston–Gyetvai [4]. Another stream of literature studies how one can optimally extract predictive information from an existing BMS, see Boucher– Inoussa [46], Boucher–Pigeon [47] and Verschuren [372].

The latter is basically what we also do in the above example: note that we include the variable BonusMalus into the feature information and, thus, we use past claims information to predict future claims. For new policies, the bonus-malus level is at 100%, and our information does not allow to clearly distinguish between new policies and policy renewals for drivers that have posterior information reflected by a bonus-malus level of 100%. Since young drivers are more likely new customers we expect interactions between the driver's age variable and the bonus-malus level, this intuition is supported by Fig. 13.12 (lhs). In order to improve our model, we would require more detailed information about past claims history. Remark that we do not strictly distinguish between prior and posterior information, here. If we go over to a time-series consideration, where more and more claims experience becomes available of an individual driver, we should clearly distinguish the different sets of information, because otherwise it may happen that in prior and posterior pricing factors we correct twice for the same factor; an interesting paper is Corradin et al. [82].

We also mention that a new source of posterior information is emerging through the collection of telematics car driving data. Telematics car driving data leads to a completely new way of posterior information rate making (experience rating), we refer to Ayuso et al. [17–19], Boucher et al. [42], Lemaire et al. [246] and Denuit et al. [98]. We mention the papers of Gao et al. [152, 154] and Meng et al. [271] who directly extract posterior feature information from telematics car driving data in order to improve rate making. This approach combines a Poisson GLM with a network extractor for the telematics car driving data.

## **5.3 Model Validation**

One of the purposes of Chap. 4 has been to describe measures to analyze how well a fitted model generalizes to unseen data. In a proper generalization analysis this requires learning data *L* for in-sample model fitting and a test sample *T* for an out-of-sample generalization analysis. In many cases, one is not in the comfortable situation of having a test sample. In such situations one can use AIC that tries to correct the in-sample figure for model complexity or, alternatively, *K*-fold crossvalidation as used in Table 5.3.

The purpose of this section is to introduce diagnostic tools for fitted models; these are often based on unit deviances d*(Yi, μi)*, which play the role of squared residuals in classical linear regression. Moreover, we discuss parameter and model selection, for instance, by step-wise backward elimination or forward selection using the analysis of variance (ANOVA) or the likelihood ratio test (LRT).

## *5.3.1 Residuals and Dispersion*

Within the EDF we distinguish two different types of residuals. The first type of residuals are based on the unit deviances d*(Yi, μi)* studied in (4.7). The *deviance* *residuals* are given by

$$r\_i^{\mathcal{D}} = \text{sign}(Y\_l - \mu\_l) \sqrt{\frac{v\_l}{\varphi} \Im(Y\_l, \mu\_l)}.$$

Secondly, *Pearson's residuals* are given by, see also (4.12),

$$r\_i^{\mathbb{P}} = \sqrt{\frac{v\_i}{\varphi}} \frac{Y\_i - \mu\_i}{\sqrt{V(\mu\_i)}}.$$

In the Gaussian case the two residuals coincide. This indicates that Pearson's residuals are most appropriate in the Gaussian case because they respect the distributional properties in that case. For other distributions, Pearson's residuals can be markedly skewed, as stated in Section 2.4.2 of McCullagh–Nelder [265], and therefore may fail to have properties similar to Gaussian residuals. An other issue occurs in Pearson's residuals when the denominator involves an estimated standard deviation <sup>√</sup>*V ( μi)*, for instance, if we work in a small frequency Poisson problem. Estimation uncertainty in small denominators of Pearson's residuals may substantially distort the estimated residuals. For this reason, we typically work with (the more robust) deviance residuals; this is related to the discussion in Chap. 4 on MSEPs versus expected deviance GLs, see Remarks 4.6.

The squared residuals provide unit deviance and weighted square loss, respectively,

$$(r\_i^{\mathcal{D}})^2 = \frac{v\_l}{\varphi} \mathfrak{d}\left(Y\_l, \mu\_l\right) \qquad \text{and} \qquad (r\_i^{\mathcal{P}})^2 = \frac{v\_l}{\varphi} \frac{(Y\_l - \mu\_l)^2}{V(\mu\_l)},$$

the latter corresponds to Pearson's *χ*2-statistic, see (4.12).

*Example 5.16 (Residuals in the Poisson Case)* In the Poisson case, Pearson's *χ*2 statistic is for *vi* = *ϕ* = 1 given by

$$(r\_i^\mathbf{P})^2 = \frac{(Y\_l - \mu\_l)^2}{\mu\_l},$$

because we have variance function *V (μ)* = *μ*. A second order Taylor expansion around *Yi* on the scale *μ* 1*/*3 *<sup>i</sup>* (for *μi*) provides approximation to the unit deviances in the Poisson case, see formula (6.4) and Figure 6.2 in McCullagh–Nelder [265],

$$\mathfrak{d}\left(Y\_{l},\mu\_{l}\right) \approx \, \mathfrak{Y}\_{l}^{1/3} \left(Y\_{l}^{1/3} - \mu\_{l}^{1/3}\right)^{2} \,. \tag{5.29}$$

This emphasizes the different behaviors around the observation *Yi* of the two types of residuals in the Poisson case. The scale *μ* 1*/*3 *<sup>i</sup>* has been motivated in McCullagh–

**Fig. 5.5** Log-likelihoods *Y (μ)* in *<sup>Y</sup>* <sup>=</sup> 1 as a function of *<sup>μ</sup>* plotted against (lhs) *<sup>μ</sup>*1*/*<sup>3</sup> in the Poisson case, (middle) *<sup>μ</sup>*−1*/*<sup>3</sup> in the gamma case with shape parameter *<sup>α</sup>* <sup>=</sup> 1, and (rhs) *<sup>μ</sup>*−<sup>1</sup> in the inverse Gaussian case with *α* = 1

Nelder [265] by providing a symmetric behavior around the mode in *Yi* = 1 of the resulting log-likelihood function, see Fig. 5.5 (lhs).

The explicit calculation of the residuals requires knowledge of the dispersion parameter *ϕ >* 0. In the Poisson Example 5.16 this dispersion parameter has been set equal to 1 because the Poisson model does neither allow for under- nor for over-dispersion. Typically, this is not the case for other models, and this requires determination of the dispersion parameter if we want to simulate from these other models. So far, this dispersion parameter has been treated as a nuisance parameter and, in fact, it canceled in MLE (because it was assumed to be constant), see Proposition 5.1.

If we need to estimate the dispersion parameter, we can either do this within MLE, see Remarks 5.2, or we can use Pearson's or the deviance estimates, respectively,

$$\widehat{\varphi}^{\mathbf{P}} = \frac{1}{n - (q + 1)} \sum\_{l=1}^{n} \frac{(Y\_l - \widehat{\mu}\_l)^2}{V(\widehat{\mu}\_l)/v\_l} \quad \text{and} \quad \widehat{\varphi}^{\mathbf{D}} = \frac{1}{n - (q + 1)} \sum\_{l=1}^{n} v\_l \Phi \left( Y\_l, \widehat{\mu}\_l \right), \tag{5.30}$$

where *μi* <sup>=</sup> *μ(xi)* are the MLE estimated means involving *<sup>q</sup>* <sup>+</sup> 1 estimated parameters *<sup>β</sup>*MLE <sup>∈</sup> <sup>R</sup>*q*+1. We briefly motivate these choices. Firstly, Pearson's estimate *<sup>ϕ</sup>*<sup>P</sup> is consistent for *<sup>ϕ</sup>*. Note that in the Gaussian case this is just the standard estimate for the variance parameter. Justification of the deviance dispersion estimate is more challenging. Consider the unscaled deviance with *<sup>μ</sup><sup>n</sup>* <sup>=</sup> *( <sup>μ</sup>*1*,..., μn)*-, see (4.9),

$$n\eta \mathfrak{D}(Y\_n, \widehat{\mu}\_n) = \sum\_{i=1}^n v\_i \mathfrak{d}\left(Y\_i, \widehat{\mu}\_i\right).$$


**Fig. 5.6** Expected unit deviance *<sup>v</sup>*E*μ*[d*(Y, μ)*] in the Poisson case as a function of <sup>E</sup>[*N*] = <sup>E</sup>[*vY*] = *vμ*; the two plots only differ in the scale on the *<sup>x</sup>*-axis

This statistic is under *certain* assumptions asymptotically *ϕχ*<sup>2</sup> *n*−*(q*+1*)* -distributed, where *χ*<sup>2</sup> *<sup>n</sup>*−*(q*+1*)* denotes a *<sup>χ</sup>*2-distribution with *<sup>n</sup>*−*(q*+1*)* degrees of freedom. Thus, this approximation gives us an expected value of *ϕ(n*−*(q*+1*))*. This exactly justifies the deviance dispersion estimate (5.30) in these cases. However, as stated in the last paragraph of Section 2.3 of McCullagh–Nelder [265], often a *χ*2-approximation is not suitable even as *n* → ∞. We give an example.

*Example 5.17 (Poisson Unit Deviances)* The deviance statistics in the Poisson model with means *μ<sup>n</sup>* = *(μ*1*,...,μn)*is given by

$$\mathfrak{D}(Y\_n, \mu\_n) = \frac{1}{n} \sum\_{l=1}^n v\_l \mathfrak{d}\left(Y\_l, \mu\_l\right) = \frac{1}{n} \sum\_{l=1}^n 2v\_l \left(\mu\_l - Y\_l - Y\_l \log\left(\frac{\mu\_l}{Y\_l}\right)\right),$$

note that in the Poisson model we have (by definition) *ϕ* = 1. We evaluate the expected value of this deviance statistics. It is given by

$$\mathbb{E}\_{\mu\_n} \left[ \mathfrak{D}(Y\_n, \mu\_n) \right] = \frac{1}{n} \sum\_{i=1}^n 2v\_i \mathbb{E}\_{\mu\_i} \left[ \mu\_i - Y\_i - Y\_i \log \left( \frac{\mu\_i}{Y\_i} \right) \right] = \frac{1}{n} \sum\_{i=1}^n 2 \mathbb{E}\_{\mu\_i} \left[ N\_i \log \left( \frac{N\_i}{v\_i \mu\_i} \right) \right],$$

with *Ni* ind*.* <sup>∼</sup> Poi*(viμi)*.

In Fig. 5.6 we plot the expected unit deviance *vμ* <sup>→</sup> *<sup>v</sup>*E*μ*[d*(Y, μ)*] in the Poisson model. In our example of Table 5.3, we have <sup>E</sup>*μ*[*vY* ] = *vμ* <sup>≈</sup> <sup>3</sup>*.*89%, which results in an expected unit deviance of *<sup>v</sup>*E*μ*[d*(Y, μ)*] ≈ <sup>25</sup>*.*52·10−<sup>2</sup> *<sup>&</sup>lt;* 1. This is in line with the losses in Table 5.3. Thus, the expected deviance *<sup>n</sup>*E*μ<sup>n</sup>* - D*(Yn,μn)* ≈ *n/*4 *< n*. Therefore it is substantially smaller than *n*. But this implies that *n*D*(Yn,μn)* cannot be asymptotically *χ*<sup>2</sup> *<sup>n</sup>*−*(q*+1*)*-distributed because the latter has an expected of value *n*−*(q*+1*)* ≈ *n* for *n* → ∞. In fact, the deviance dispersion estimate is not consistent in this example, and for a consistent estimate one should rely on Pearson's deviance estimate.

In order to have an asymptotic *χ*2-distribution we need to have large volumes *v* because then a saddlepoint approximation holds that allows to approximate the (scaled) unit deviances by *χ*2-distributions, see Sect. 5.5.2, below. -

## *5.3.2 Hypothesis Testing*

Consider a sub-vector *<sup>β</sup><sup>r</sup>* <sup>∈</sup> <sup>R</sup>*<sup>r</sup>* of the GLM parameter *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+1, for *r<q* <sup>+</sup> 1. We would like to understand if we can set this sub-vector *β<sup>r</sup>* = 0, and at the same time we do not lose any generalization power. Thus, we investigate whether there is a simpler *nested* GLM that provides a similar prediction accuracy. If this is the case, preference should be given to the simpler model because the bigger model seems over-parametrized (has redundancy, is not parsimonious). This section is based on Section 2.2.2 of Fahrmeir–Tutz [123].

**Geometric Interpretation** We begin by giving a geometric interpretation. We start from the full model being expressed by the design matrix <sup>X</sup> <sup>∈</sup> <sup>R</sup>*n*×*(q*+1*)* . This design matrix together with the link function *g* generates a *(q* + 1*)*-dimensional manifold <sup>M</sup> <sup>⊂</sup> <sup>R</sup>*<sup>n</sup>* given by, see (5.19) and Fig. 5.2,

$$\mathfrak{M} = \left\{ \mu = g^{-1}(\mathfrak{X}\mathfrak{f}) = (g^{-1}\langle \mathfrak{f}, \mathfrak{x}\_1 \rangle, \dots, g^{-1}\langle \mathfrak{f}, \mathfrak{x}\_n \rangle)^\top \in \mathbb{R}^n \, \middle| \, \mathfrak{f} \in \mathbb{R}^{q+1} \right\} \subset \mathbb{R}^n.$$

The MLE *<sup>β</sup>*MLE is determined by the point in <sup>M</sup> that minimizes the distance to *<sup>Y</sup>*, where distance between *Y* and M is measured component-wise by *vi <sup>ϕ</sup>* <sup>d</sup>*(Yi, μi)* with *<sup>μ</sup>* <sup>∈</sup> <sup>M</sup>, i.e., w.r.t. the KL divergence.

Assume, now, that we want to drop the components *β<sup>r</sup>* in *β*, i.e., we want to drop these columns from the design matrix resulting in a smaller design matrix <sup>X</sup>*<sup>r</sup>* <sup>∈</sup> <sup>R</sup>*n*×*(q*+1−*r)*. This generates a *(q* <sup>+</sup> <sup>1</sup> <sup>−</sup> *r)*-dimensional *nested* manifold <sup>M</sup>*<sup>r</sup>* <sup>⊂</sup> <sup>M</sup> described by

$$\mathfrak{M}\_r = \left\{ \mu = g^{-1}(\mathfrak{X}\_r \mathfrak{B}) \in \mathbb{R}^n \, \middle| \, \mathfrak{B} \in \mathbb{R}^{q+1-r} \right\} \subset \mathfrak{M}.$$

If the distance of *Y* to M*<sup>r</sup>* and M is roughly the same, we should go for the smaller model. In the Gaussian case of Example 5.9 this can be explained by the Pythagorean theorem applied to successive orthogonal projections. In the general unit deviance case, this has to be studied in terms of information geometry considering the KL divergence, see Sect. 2.3.

**Likelihood Ratio Test (LRT)** We consider the testing problem of the null hypothesis *H*<sup>0</sup> against the alternative hypothesis *H*<sup>1</sup>

$$H\_0: \mathfrak{P}\_r = 0 \qquad \text{against} \qquad H\_1: \mathfrak{P}\_r \neq 0. \tag{5.31}$$

Denote by *<sup>β</sup>*MLE the MLE under the full model and by *<sup>β</sup>*MLE *(*−*r)* the MLE under the null hypothesis model. Define the (log-)*likelihood ratio test (LRT) statistics*

$$\Lambda = -2\left(\ell\_Y(\widehat{\mathfrak{F}}\_{(-r)}^{\text{MLE}}) - \ell\_Y(\widehat{\mathfrak{F}}^{\text{MLE}})\right) \ge 0.$$

The inequality holds because the null hypothesis model is nested in the full model, henceforth, the latter needs to have a bigger log-likelihood value in the MLE. If the LRT statisticsis large, the null hypothesis should be rejected because the reduced model is not competitive compared to the full model. More mathematically, under similar conditions as for the asymptotic normality results of the MLE of *β* in (5.17), we have that under the null hypothesis *H*<sup>0</sup> the LRT statisticsis asymptotically *χ*2-distributed with *r* degrees of freedom. Therefore, we should reject the null hypothesis in favor of the full model if the resulting *p*-value of under the *χ*<sup>2</sup> *<sup>r</sup>* -distribution is too small. These results remain true if the unknown dispersion parameter *<sup>ϕ</sup>* is replaced by a consistent estimator *<sup>ϕ</sup>*, e.g., Pearson's dispersion estimate *<sup>ϕ</sup>*<sup>P</sup> (from the bigger model).

The LRT statisticsmay not be properly defined in over-dispersed situations where the distributional assumptions are not fully specified, for instance, in an overdispersed Poisson model. In such situations, one usually divides the log-likelihood (of the Poisson model) by the estimated over-dispersion and then uses the resulting scaled LRT statisticsas an approximation to the unspecified model.

**Wald Test** Alternatively, we can use the Wald statistics. The Wald statistics uses a second order approximation to the log-likelihood and, therefore, is only based on the first two moments (and not on the entire distribution). Define the matrix *Ir* <sup>∈</sup> <sup>R</sup>*r*×*(q*+1*)* such that *<sup>β</sup><sup>r</sup>* <sup>=</sup> *Irβ*, i.e., matrix *Ir* selects exactly the components of *β* that are included in *β<sup>r</sup>* (and which are set to 0 under the null hypothesis *H*<sup>0</sup> given in (5.31)).

Asymptotic normality (5.17) motivates consideration of the Wald statistics

$$W = (I\_r \widehat{\mathfrak{F}}^{\text{ML.E}} - 0)^\top \left( I\_r \mathcal{Z} (\widehat{\mathfrak{F}}^{\text{ML.E}})^{-1} I\_r^\top \right)^{-1} (I\_r \widehat{\mathfrak{F}}^{\text{ML.E}} - 0). \tag{5.32}$$

The Wald statistics measures the distance between the MLE in the full model *Ir <sup>β</sup>*MLE restricted to the components of *<sup>β</sup><sup>r</sup>* and the null hypothesis *<sup>H</sup>*<sup>0</sup> (being *<sup>β</sup><sup>r</sup>* <sup>=</sup> 0). The estimated Fisher's information matrix *<sup>I</sup>( <sup>β</sup>*MLE*)* is used to bring all components onto the same unit scale (and to account for collinearity). The Wald statistics *W* is asymptotically *χ*<sup>2</sup> *<sup>r</sup>* -distributed under the same assumptions as for (5.17) to hold. Thus, the null hypothesis *H*<sup>0</sup> should be rejected if the resulting *p*- value of *W* under the *χ*<sup>2</sup> *<sup>r</sup>* -distribution is too small. Note that this test does not require calculation of the MLE in the null hypothesis model, i.e., this test is computationally more attractive than the LRT because we only need to fit one model. Again, an unknown dispersion parameter *ϕ* in Fisher's information matrix *I(β)* is replaced by a consistent estimator *<sup>ϕ</sup>* (from the bigger model).

In the special case of considering only one component of *β*, i.e., if *β<sup>r</sup>* = *βk* with *r* = 1 and for one selected component 0 ≤ *k* ≤ *q*, the Wald statistics reduces to

$$W\_k = \frac{(\widehat{\beta}\_k^{\text{MLE}})^2}{\widehat{\sigma}\_k^2} \qquad \text{or} \qquad T\_k = W\_k^{1/2} = \frac{\widehat{\beta}\_k^{\text{MLE}}}{\widehat{\sigma}\_k},\tag{5.33}$$

with diagonal entries of the inverse of the estimated Fisher's information matrix given by *<sup>σ</sup>*<sup>2</sup> *<sup>k</sup>* <sup>=</sup> *(I( <sup>β</sup>*MLE*)*−<sup>1</sup>*)k,k*, 0 <sup>≤</sup> *<sup>k</sup>* <sup>≤</sup> *<sup>q</sup>*. The square-roots of these estimates are provided in column Std. Error of the R output in Listing 5.3.

In this case the Wald statistics *Wk* is equal to the square of the *t*-statistics *Tk*; this *t*-statistics is provided in column z value of the R output of Listing 5.3. Remark that Fisher's information matrix involves the dispersion parameter *ϕ*. If this dispersion parameter is estimated with a consistent estimator *<sup>ϕ</sup>* we have a *<sup>t</sup>*statistics. For known dispersion parameter the *t*-statistics reduces to a *z*-statistics, i.e., the corresponding *p*-values can be calculated from a normal distribution instead of a *t*-distribution. In the Poisson case, the dispersion *ϕ* = 1 is known, and for this reason, we perform a *z*-test (and not a *t*-test) in the last column of Listing 5.3; and we call *Tk* a *z*-statistics in that case.

## *5.3.3 Analysis of Variance*

In the previous section, we have presented tests that allow for model selection in the case of nested models. More generally, if we have a full model, say, based on regression parameter *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+<sup>1</sup> we would like to select the "best" submodel according to some selection criterion. In most cases, it is computationally not feasible to fit all sub-models if *q* is large, therefore, this is not a practical solution. For large models and data sets step-wise procedures are a feasible tool. *Backward elimination* starts from the full model, and then recursively drops feature components which have high *p*-values in the corresponding Wald statistics (5.32) and (5.33). Performing this recursively will provide us with hierarchy of nested models. *Forward selection* works just in the opposite direction, that is, we start with the null model and we include feature components one after the other that have a low *p*-value in the corresponding Wald statistics.

#### *Remarks 5.18*


Typically, in practice, a so-called analysis of variance (ANOVA) table is studied. The ANOVA table is mainly motivated by the Gaussian model with orthogonal data. The Gaussian assumption implies that the deviance loss is equal to the square loss and the orthogonality implies that the square loss decouples in an additive way w.r.t. the feature components. This implies that one can explicitly study the contribution of each feature component to the decrease in square loss; an example is given in Section 2.3.2 of McCullagh–Nelder [265]. In non-Gaussian and non-orthogonal situations one loses this additivity property and, as mentioned in Remarks 5.18, the order of inclusion matters. Therefore, for the ANOVA table we pre-specify the order in which the components are included and then we analyze the decrease of deviance loss by the inclusion of additional components.

*Example 5.19 (Poisson GLM1, Revisited)* We revisit the MTPL claim frequency example of Sect. 5.2.4 to illustrate the variable selection procedures. Based on the model presented in Listing 5.3 we run an ANOVA analysis using the R command anova, the results are presented in Listing 5.4.

Listing 5.4 shows the hierarchy of models starting from the null model by sequentially including feature components one by one. The column Df gives the number of regression parameters involved and the column Deviance the decrease of deviance loss by the inclusion of this feature component. The biggest model improvements are provided by the bonus-malus level and driver's age, this is not surprising in view of the empirical analysis in Figs. 5.3 and 5.4, and in Chap. 13.1. At the other end we have the Area code which only seems to improve the model marginally. However, this does not imply, yet, that this variable should be dropped. There are two points that need to be considered: (1) maybe feature pre-processing of Area has not been done in an appropriate way and the variable is not in the right functional form for the chosen link function; and (2) Area is the last variable included in the model in Listing 5.4 and, maybe, there are already other variables

```
1 Analysis of Deviance Table
2
3 Model: poisson, link: log
4
5 Response: ClaimNb
6
7 Terms added sequentially (first to last)
8
9
10 Df Deviance Resid. Df Resid. Dev
11 NULL 610205 153852
12 VehPowerGLM 5 73.7 610200 153779
13 VehAgeGLM 2 179.7 610198 153599
14 DrivAgeGLM 6 1199.4 610192 152400
15 BonusMalusGLM 1 4300.6 610191 148099
16 VehBrand 10 240.3 610181 147859
17 VehGas 1 82.4 610180 147776
18 DensityGLM 1 512.1 610179 147264
19 Region 21 191.3 610158 147073
20 AreaGLM 1 4.1 610157 147069
```
that take over the role of Area in smaller models which is possible if we have correlations between the feature components. In our data, Area and Density are highly correlated. For this reason, we exchange the order of these two components and run the same analysis again, we call this model Poisson GLM1B (which of course provides the same predictive model as Poisson GLM1).

**Listing 5.5** ANOVA table of model Poisson GLM1B

```
1 Analysis of Deviance Table
2
3 Model: poisson, link: log
4
5 Response: ClaimNb
6
7 Terms added sequentially (first to last)
8
9
10 Df Deviance Resid. Df Resid. Dev
11 NULL 610205 153852
12 VehPowerGLM 5 73.7 610200 153779
13 VehAgeGLM 2 179.7 610198 153599
14 DrivAgeGLM 6 1199.4 610192 152400
15 BonusMalusGLM 1 4300.6 610191 148099
16 VehBrand 10 240.3 610181 147859
17 VehGas 1 82.4 610180 147776
18 AreaGLM 1 505.0 610179 147271
19 Region 21 192.4 610158 147079
20 DensityGLM 1 10.1 610157 147069
```
Listing 5.5 shows the ANOVA table if we exchange the order of these two variables. We observe that the magnitudes of the decrease of the deviance loss has switched between the two variables. Overall, Density seems slightly more predictive, and we may consider dropping Area from the model, also because the correlation between Density and Area is very high.

If we want to perform backward elimination (sequentially drop one variable after the other) we can use the R command drop1. For small models this is doable, for larger models it is computationally demanding.

**Listing 5.6** drop1 analysis of model Poisson GLM1

```
1 Single term deletions
2
3 Model:
4 ClaimNb ~ VehPowerGLM + VehAgeGLM + DrivAgeGLM + BonusMalusGLM +
5 VehBrand + VehGas + DensityGLM + Region + AreaGLM
6 Df Deviance AIC LRT Pr(>Chi)
7 <none> 147069 192818
8 VehPowerGLM 5 147152 192892 83.4 < 2.2e-16 *** 9 VehAgeGLM 2 147283 193028 214.1 < 2.2e-16 *** 10 DrivAgeGLM 6 147603 193341 534.5 < 2.2e-16 *** 11 BonusMalusGLM 1 150970 196718 3901.5 < 2.2e-16 *** 12 VehBrand 10 147298 193027 228.9 < 2.2e-16 *** 13 VehGas 1 147213 192961 144.5 < 2.2e-16 *** 14 DensityGLM 1 147079 192826 10.1 0.001459 ** 15 Region 21 147259 192967 190.7 < 2.2e-16 *** 16 AreaGLM 1 147073 192820 4.1 0.042180 * 17 ---
18 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```
In Listing 5.6 we present the results of this drop1 analysis. Both, according to AIC and according to the LRT, we should keep all variables in the model. Again, Area and Density provide the smallest LRT statisticswhich illustrates the high collinearity between these two variables (note that the values in Listing 5.6 are identical to the ones in Listings 5.4 and 5.5, respectively).

We conclude that in model Poisson GLM1 we should keep all feature components, and a model improvement can only be obtained by a different feature pre-processing, by a different regression function or by a different distributional model. -

## *5.3.4 Lab: Poisson GLM for Car Insurance Frequencies, Revisited*

#### **Continuous Coding of Non-monotone Feature Components**

We revisit model Poisson GLM1 studied in Sect. 5.2.4 for MTPL claim frequency modeling, and we consider additional competing models by using different feature pre-processing. From Example 5.19, above, we conclude that we should keep all variables in the model if we work with model Poisson GLM1.


**Table 5.4** Contingency table of observed number of policies against predicted number of policies with given claim counts ClaimNb

We calculate Pearson's dispersion estimate which provides *<sup>ϕ</sup>*<sup>P</sup> <sup>=</sup> <sup>1</sup>*.*<sup>6697</sup> *<sup>&</sup>gt;* 1. This indicates that the model is not fully suitable for our data because in a Poisson model the dispersion parameter should be equal to 1. There may be two reasons for this over-dispersion: (1) the Poisson assumption is not appropriate because, for instance, the tail of the observations is more heavy-tailed, or (2) the Poisson assumption is appropriate but the regression function has not been chosen in a fully suitable way (maybe also due to missing feature information).

We believe that in our example the observed over-dispersion is a mixture of the two reasons (1) and (2). Surely, the regression structure can be improved since our feature pre-processing is non-optimal and since the chosen regression function only considers multiplicative interactions between the feature components (we have chosen the log-link regression function without adding interaction terms to the regression function).

Table 5.4 gives a contingency table. We observe that we have much more policies with more than 1 claim compared to what is predicted by the fitted model. As a result, a *χ*2-test rejects this Poisson model because the resulting *p*-value is close to 0.

In our data, we have a rather large number of policies with short exposures *vi*, and further analysis suggests that these short exposures are not suitably modeled. We will not invest more time into improving the exposure modeling. As mentioned in the appendix, there seem to be a couple of issues how the exposures are displayed and how policy renewals are accounted for in this data. However, it is difficult (almost impossible) to clean the data for better exposure measures without more detailed information about the data collection process.

Our next aim is to model continuous feature components differently, if their raw form does not match the linear predictor assumption. In Poisson GLM1 we have categorized such components and then used dummy coding for the resulting classes, see Sect. 5.2.4. Alternatively, we can use different functional forms, for instance, we can use for DrivAge the following pre-processing

$$\text{DirivAge} \mapsto \beta\_l \text{DirivAge} + \beta\_{l+1} \log(\text{DirivAge}) + \sum\_{j=2}^{4} \beta\_{l+j} (\text{DirivAge})^j. \tag{5.34}$$


**Table 5.5** Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses, tenfold cross-validation losses (units are in 10−2) and in-sample average frequency of the null model (intercept model) and of different Poisson GLMs

This replaces the *K* = 7 categorical age classes of model Poisson GLM1 by 5 continuous functions of the variable DrivAge, and the number of regression parameters is reduced from *K* − 1 = 6 to 5. We call this model Poisson GLM2.

Besides improving the modeling of the feature components we can also start to add interactions beyond the multiplicative ones. For instance, Fig. 13.12 in Chap. 13 may indicate that there is an interaction term between BonusMalus and DrivAge. New young drivers enter the bonus-malus system at level 100, and it takes some years free of accidents to reach the lowest bonus-malus level of 50. Whereas for senior drivers a bonus-malus level of 100 may indicate that they have had a bad claim experience because otherwise they would be on the lowest bonus-malus level, see also Remark 5.15. We are adding the following interaction to Poisson GLM2 and we call the resulting model Poisson GLM3

$$\boldsymbol{\beta}\_{l'} \mathbf{B} \mathbf{n} \mathbf{u} \mathbf{s} \mathbf{M} \mathbf{1} \mathbf{u} \mathbf{s} \cdot \mathbf{D} \mathbf{r} \mathbf{i} \mathbf{v} \mathbf{A} \mathbf{g} \mathbf{e} + \boldsymbol{\beta}\_{l'+1} \mathbf{B} \mathbf{n} \mathbf{u} \mathbf{s} \mathbf{M} \mathbf{1} \mathbf{u} \mathbf{s} \cdot (\mathbf{D} \mathbf{r} \mathbf{i} \mathbf{v} \mathbf{A} \mathbf{g})^2. \tag{5.35}$$

From Table 5.5 we observe that this leads to a further small model improvement. We mention that this model improvement can also be observed in a decrease of Pearson's dispersion estimate to *<sup>ϕ</sup><sup>P</sup>* <sup>=</sup> <sup>1</sup>*.*6644. Noteworthy, all model selection criteria AIC, out-of-sample generalization loss and cross-validation come to the same conclusion in this example.

The tedious task of the modeler now is to find all these systematic effects and bring them in an appropriate form into the model. Here, this is still possible because we have a comparably small model. However, if we have hundreds of feature components, such a manual analysis becomes intractable. Other regression models such as network regression models should be preferred, or at least should be used to find systematic effects. But, one should also keep in mind that the (final) chosen model should be as simple as possible (parsimonious).

#### *Remarks 5.20*

• An advantage of GLMs is that these regression models can deal with collinearity in feature components. Nevertheless, the results should be carefully checked if the collinearity in feature components is very high. If we have a high collinearity between two feature components then we may observe large values with opposite signs in the corresponding regression parameters compensating each other. The

```
1 Single term deletions
2
3 Model:
4 ClaimNb ~ VehPowerGLM + VehAgeGLM + DrivAge + log(DrivAge) +
5 I(DrivAge^2) + I(DrivAge^3) + I(DrivAge^4) + BonusMalusGLM +
6 VehBrand + VehGas + DensityGLM + Region + AreaGLM
7 Df Deviance AIC LRT Pr(>Chi)
8 <none> 147005 192753
9 VehPowerGLM 5 147087 192825 82.4 2.671e-16 *** 10 VehAgeGLM 2 147225 192969 220.3 < 2.2e-16 *** 11 DrivAge 1 147157 192902 151.9 < 2.2e-16 *** 12 log(DrivAge) 1 147190 192935 184.8 < 2.2e-16 *** 13 I(DrivAge^2) 1 147123 192869 118.1 < 2.2e-16 *** 14 I(DrivAge^3) 1 147094 192840 89.0 < 2.2e-16 *** 15 I(DrivAge^4) 1 147071 192816 65.5 5.687e-16 *** 16 BonusMalusGLM 1 150907 196653 3902.0 < 2.2e-16 *** 17 VehBrand 10 147232 192959 226.5 < 2.2e-16 *** 18 VehGas 1 147148 192893 142.8 < 2.2e-16 *** 19 DensityGLM 1 147015 192761 10.1 0.001498 ** 20 Region 21 147193 192899 188.0 < 2.2e-16 *** 21 AreaGLM 1 147009 192755 4.1 0.043123 * 22 ---
23 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```
resulting GLM will not be very robust, and a slight change in the observations may change these regression parameters completely. In this case one should drop one of the two highly collinear feature components. This problem may also occur if we include too many terms in functional forms like in (5.34).

• A tool to find suitable functional forms of regression functions in continuous feature components are the partial residual plots of Cook–Croos-Dabrera [80]. If we want to analyze the first feature component *x*<sup>1</sup> of *x*, we can fit a GLM to the data using the entire feature vector *x*. The partial residuals for component *x*<sup>1</sup> are defined by, see formula (8) in Cook–Croos-Dabrera [80],

$$r\_{l}^{\text{partial}} = (Y\_{l} - \mu(\mathbf{x}\_{l})) \mathbf{g}'(\mu(\mathbf{x}\_{l})) + \beta\_{l} \mathbf{x}\_{l,1} \qquad \text{ for } 1 \le i \le n,$$

where *g* is the chosen link function and *g(μ(xi))* = *β, x<sup>i</sup>* . These partial residuals offset the effect of feature component *x*1. The partial residual plot shows *r* partial *<sup>i</sup>* against *xi,*1. If this plot shows a linear structure then including *x*<sup>1</sup> linearly is justified, and any other functional form may be detected from that plot.

#### **Under-Sampling and Over-Sampling**

Often run times are an issue in model fitting, in particular, if we want to experiment with different models, different feature codings, etc. Under-sampling is an interesting approach that can be applied in imbalanced situations (like in our claim frequency data situation) to speed up calculations, and still receiving accurate approximations. We briefly describe under-sampling in this subsection.

Under-sampling is based on the idea that we do not need to consider all *n* = 610 206 insurance policies for model fitting, and we can still receive accurate results. For this we select all insurance policies that have at least 1 claim; in our data these are 22'434 insurance policies, we call this data set *L*<sup>∗</sup> <sup>≥</sup>1. The motivation for selecting these insurance policies is that these are exactly the policies that have information about the drivers causing claims. These selected insurance policies need to be complemented with policies that do not cause any claims. We select at random (under-sample) 22'434 insurance policies of drivers without claims, we call this data set *L*<sup>∗</sup> <sup>0</sup>. Merging the two sets we receive data *L*<sup>∗</sup> = *L*<sup>∗</sup> <sup>0</sup> ∪ *L*<sup>∗</sup> <sup>≥</sup><sup>1</sup> comprising 44'868 insurance policies. This data is balanced from the viewpoint of claim causing policies because exactly half of the policies in *L*<sup>∗</sup> suffers a claim and the other half does not. The idea now is to fit a GLM only on this learning data *L*∗, and because we only consider 44'868 insurance policies the fitting should be fast.

There is still one point to be considered, namely, in the new learning data *L*<sup>∗</sup> policies with claims are over-represented (because we work in a low frequency problem). This motivates that we adjust the time exposures *vi* in *L*<sup>∗</sup> <sup>0</sup> accordingly by multiplying as follows

$$v\_{\boldsymbol{\iota}} \mapsto \ v\_{\boldsymbol{\iota}}^{\*} = v\_{\boldsymbol{\iota}} \frac{\sum\_{j=1}^{n} \upsilon\_{j} \mathbb{1}\_{\{N\_{j} = 0\}}}{\sum\_{\boldsymbol{\upsilon}\_{j} \in \mathcal{L}\_{0}^{\*}} \upsilon\_{j}}.$$

Thus, we stretch the exposures of the policies without claims in *L*∗; for our data this factor is 26.17. This then provides us with an empirical frequency on *L*<sup>∗</sup> of 7.36% which is identical to the observed frequency on the entire learning data *L*.

We fit model Poisson GLM3 on this reduced (and exposure adjusted) learning data *L*∗, the results are presented on the last line of Table 5.6. This model can be fitted in 1s, and by construction it fulfills the balance property. The resulting insample and out-of-sample losses (evaluated on the entire data *L* and *T* ) are very close to model Poisson GLM3 which verifies that the model fitted only on the learning data *L*<sup>∗</sup> gives a good approximation. We do not provide AIC because the data used is not identical to the data used to fit the other models. The tenfold cross-

**Table 5.6** Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses, tenfold cross-validation losses (units are in 10−2) and in-sample average frequency of the null model (intercept model) and of different Poisson GLMs, the last row uses under-sampling in model Poisson GLM3


validation loss is a little bit bigger which seems to be a consequence of applying the non-stratified version to only 44'868 insurance policies, i.e., this higher crossvalidation loss shows that we fit the model on less data which provides higher uncertainty in model fitting. This finishes this example.

The presented method is called under-sampling because we under-sample from the insurance policies without claims to make both classes (policies with claims and policies without claims) equally large. Alternatively, to achieve a class balance we could also over-sample from the minority class by duplicating policies. This has a similar effect, but it increases run times. Importantly, if we under- or over-sample we *have* to adjust the exposures correspondingly. Otherwise we obtain a biased model that is not useful for pricing, the same applies to methods such as the synthetic minority oversampling technique (SMOTE) and similar techniques.

Alternatively, to under-sampling we could also fit a so-called zero-truncated Poisson (ZTP) model to the data by only fitting a model on the insurance policies that suffer at least one claim, and adjusting the distribution to the observations *Ni*|{*Ni*≥1}. This is rather similar to a hurdle Poisson model and we come back to this in Example 6.19, below.

## *5.3.5 Over-Dispersion in Claim Counts Modeling*

#### **Mixed Poisson Distribution**

In the previous example we have seen that the considered Poisson GLMs do not fully fit our data, at least not with the chosen feature engineering, because there is overdispersion in the data (relative to the chosen models). This may give rise to consider models that allow for over-dispersion. Typically, such over-dispersed models are constructed starting from the Poisson model, because the Poisson model enjoys many nice properties as we have seen above. A natural extension is to introduce the family of mixed Poisson models, where the frequency is not modeled with a single parameter but rather with a whole family of parameters described by an underlying mixing distribution.

In the dual mean parametrization the Poisson distribution for *Y* = *N/v* reads as

$$Y \sim f(\mathbf{y}; \lambda, \upsilon) = e^{-\upsilon \lambda} \frac{(\upsilon \lambda)^{\upsilon \mathbf{y}}}{(\upsilon \mathbf{y})!} \qquad \text{for } \mathbf{y} \in \mathbb{N}\_0/\upsilon,$$

where the mean parameter is given by *λ* = *κ (θ )* = exp{*θ*}, and *θ* denotes the canonical parameter; on purpose we use for the mean notation *λ* instead of *μ*, here, the reason will become clear below. This model satisfies for the first two moments of *N* = *vY*

$$\mathbb{E}\_{\lambda}\left[N\right] = \upsilon \kappa'(\theta) = \upsilon \lambda \qquad \text{and} \qquad \text{Var}\_{\lambda}\left(N\right) = \upsilon \kappa''(\theta) = \upsilon \lambda = \mathbb{E}\_{\lambda}\left[N\right],$$

with dispersion parameter *ϕ* = 1. A mixed Poisson distribution is obtained by mixing/integrating over different frequency parameters *λ >* 0. We choose a distribution *<sup>π</sup>* on <sup>R</sup><sup>+</sup> (strictly positively supported), and define the new distribution

$$Y = N/v \curvearrowright f\_{\pi}(\mathbf{y}; v) = \int\_{\mathbb{R}\_+} f(\mathbf{y}; \lambda, v) \, d\pi(\lambda) = \int\_{\mathbb{R}\_+} e^{-v\lambda} \frac{(v\lambda)^{v\chi}}{(v\chi)!} \, d\pi(\lambda). \tag{5.36}$$

If *π* is not concentrated in a single point, the tower property immediately implies

$$\mathbb{E}\_{\pi} \left[ N \right] < \text{Var}\_{\pi} \left( N \right), \tag{5.37}$$

supposed that the moments exist, we refer to Lemma 2.18 in Wüthrich [387]. Hence, mixing over different frequency parameters allows us to receive over-dispersion. Of course, this concept can also be applied to mixing over the canonical parameter *θ* in the EF (instead of the mean parameter).

This leads to the framework of Bayesian credibility models which are widely used and studied in actuarial science, we refer to the textbook of Bühlmann–Gisler [58]. We have already met this idea in the Bayesian decision rule of Example 3.3 which has led to the Bayesian estimator in Definition 3.6.

#### **Negative-Binomial Model**

In the case of the Poisson model, the gamma distribution is a particularly attractive mixing distribution for *λ* because it allows for a closed-form solution in (5.36), and *fπ*=*(y*; *v)* will be a negative-binomial distribution.<sup>4</sup> One can choose different parametrizations of this mixing distribution, and they will provide different scalings in the resulting negative-binomial distribution. We choose the following parametrization *π(λ) (d)* = *(vα, vα/μ)* for mean parameter *μ >* 0 and shape parameter *vα >* 0. This implies, see (5.36),

$$\begin{split} f\_{\mathrm{NB}}(\mathbf{y};\mu,\upsilon,\alpha) &= \int\_{\mathbb{R}\_{+}} e^{-\upsilon\lambda} \frac{(\upsilon\lambda)^{\upsilon\eta}}{(\upsilon\mathbf{y})!} \frac{(\upsilon\alpha/\mu)^{\upsilon\alpha}}{\Gamma(\upsilon\alpha)} \lambda^{\upsilon\alpha-1} e^{-\upsilon\alpha\lambda/\mu} d\lambda \\ &= \frac{\Gamma(\upsilon\mathbf{y}+\upsilon\alpha)}{(\upsilon\mathbf{y})!\Gamma(\upsilon\alpha)} \frac{\upsilon^{\upsilon\mathbf{y}}(\upsilon\alpha/\mu)^{\upsilon\alpha}}{(\upsilon+\upsilon\alpha/\mu)^{\upsilon\mathbf{y}+\upsilon\alpha}} \\ &= \binom{\upsilon\mathbf{y}+\upsilon\alpha-1}{\upsilon\mathbf{y}} \left(e^{\theta}\right)^{\upsilon\mathbf{y}} \left(1-e^{\theta}\right)^{\upsilon\alpha}, \end{split}$$

<sup>4</sup> The gamma distribution is the conjugate prior to the Poisson distribution. As a result, the posterior distribution, given observations, will again be a gamma distribution with posterior parameters, see Section 8.1 of Wüthrich [387]. This Bayesian model has been introduced to the actuarial literature by Bichsel [32].

setting for canonical parameter *θ* = log*(μ/(μ* + *α)) <* 0. This is the negativebinomial distribution we have already met in (2.5). A single-parameter linear EDF representation is given by, we set unit dispersion parameter *ϕ* = 1,

$$Y \sim f\_{\rm NB}(\mathbf{y}; \theta, v, \alpha) = \exp\left\{ \frac{\mathbf{y}\theta + \alpha \log(1 - e^{\theta})}{1/v} + \log\left(\begin{array}{c} v\mathbf{y} + v\alpha - 1\\ v\mathbf{y} \end{array}\right) \right\},\tag{5.38}$$

where this is a density w.r.t. the counting measure on N0*/v*. The cumulant function and the canonical link, respectively, are given by

$$\kappa(\theta) = -\alpha \log(1 - e^{\theta}) \quad \text{and} \quad \theta = h(\mu) = \log\left(\frac{\mu}{\mu + \alpha}\right) \in \Theta = (-\infty, 0).$$

Note that *α >* 0 is treated as nuisance parameter (which is a fixed part of the cumulant function, here). The first two moments of the claim count *N* = *vY* are given by

$$
v\mu = \mathbb{E}\_{\theta}[N] \;=\; v\alpha \frac{e^{\theta}}{1 - e^{\theta}},\tag{5.39}$$

$$\text{Var}\_{\theta}(N) = \mathbb{E}\_{\theta}[N] \left( 1 + \frac{e^{\theta}}{1 - e^{\theta}} \right) = \mathbb{E}\_{\theta}[N] \left( 1 + \frac{\mu}{\alpha} \right) \\ \quad > \mathbb{E}\_{\theta}[N]. \quad (5.40)$$

This shows that we receive a fixed over-dispersion of size *μ/α*, which (in this parametrization) does not depend on the exposure *v*; this is the reason for choosing a mixing distribution *π(λ) (d)* = *(vα, vα/μ)*. This parametrization is called NB2 parametrization.

#### *Remarks 5.21*


$$\mathfrak{d}(\mathbf{y},\mu) \mapsto \mathfrak{d}(\mathbf{y},\mu) = 2\left[\mathbf{y}\log\left(\frac{\mathbf{y}}{\mu}\right) - (\mathbf{y}+\alpha)\log\left(\frac{\mathbf{y}+\alpha}{\mu+\alpha}\right)\right],$$

we also refer to Table 4.1 for *α* = 1. We emphasize that this is the unit deviance in a single-parameter linear EDF, and we only aim at estimating canonical parameter *θ* ∈ and mean parameter *μ* ∈ *M*, respectively, whereas *α >* 0 is treated as a given nuisance parameter. This is important because the unit deviance relies on the saturated model which, in general, estimates a one-dimensional parameter *θ* and *μ*, respectively, from the one-dimensional observation *Y* . The nuisance parameter is not affected by the consideration of the saturated model, and it is treated as a fixed part of the cumulant function, which is not estimated at this stage. An important consequence of this is that model comparison using deviance residuals only works for identical nuisance parameters.

• We mention that we receive over-dispersion in (5.40) though we have dispersion parameter *ϕ* = 1 in (5.38). Alternatively, we could do the duality transformation *<sup>y</sup>* <sup>→</sup> \**<sup>y</sup>* <sup>=</sup> *y/α* for nuisance parameter *α >* 0; this gives the reproductive form of the negative-binomial model NB2, see also Remarks 2.13. This provides us with a density on <sup>N</sup>0*/(vα)*, set \**<sup>ϕ</sup>* <sup>=</sup> <sup>1</sup>*/α*,

$$\widetilde{Y} \sim f\_{\sf NS}(\widetilde{\mathbf{y}}; \theta, v/\widetilde{\boldsymbol{\varphi}}) = \exp\left\{ \frac{\widetilde{\mathbf{y}}\theta + \log(1 - e^{\theta})}{1/(v\alpha)} + \log\left(\frac{v\alpha\widetilde{\mathbf{y}} + v\alpha - 1}{v\alpha\widetilde{\mathbf{y}}}\right) \right\}.$$

The cumulant function and the canonical link, respectively, are now given by

$$\kappa(\theta) = -\log(1 - e^{\theta}) \quad \text{and} \quad \theta = h(\widetilde{\mu}) = \log\left(\frac{\widetilde{\mu}}{\widetilde{\mu} + 1}\right) \in \Theta = (-\infty, 0).$$

The first two moments are for *θ* ∈ given by

$$
\widetilde{\mu} = \mathbb{E}\_{\theta}[\widetilde{Y}] \, \, \, \, \, \frac{e^{\theta}}{1 - e^{\theta}} \, , \,
$$

$$
\text{Var}\_{\theta}(\widetilde{Y}) = \frac{\widetilde{\varphi}}{v} \, \, \kappa''(\theta) \, \, \, \, \, \, \frac{1}{v\alpha} \, \, \widetilde{\mu} \, \, (1 + \widetilde{\mu}) \, \, \,.
$$

Thus, we receive the reproductive EDF representation with dispersion parameter \**<sup>ϕ</sup>* <sup>=</sup> <sup>1</sup>*/α* and variance function *V (*\**μ)* <sup>=</sup> \**μ(*<sup>1</sup> <sup>+</sup> \**μ)*. Moreover, *<sup>N</sup>* <sup>=</sup> *vY* <sup>=</sup> *vαY* \*.

• The negative-binomial model with the NB1 parametrization uses the mixing distribution *π(λ) (d)* <sup>=</sup> *(μv/α, v/α)*. This leads to mean <sup>E</sup>*<sup>θ</sup>* [*N*] = *vμ* and variance Var*<sup>θ</sup> (N)* <sup>=</sup> <sup>E</sup>*<sup>θ</sup>* [*N*]*(*<sup>1</sup> <sup>+</sup> *α)*. In this parametrization, *<sup>μ</sup>* enters the gamma function as *(μv/α)* in the gamma density which does not allow for an EDF representation. This parametrization has been called NB1 by Cameron–Trivedi [63] because both terms in the variance Var*<sup>θ</sup> (N)* = *vμ* + *vμα* are linear in *μ*. In contrast, in the NB2 parametrization the second term has a square *vμ*2*/α* in *μ*, see (5.40). Further discussion is provided in Greene [171].

#### Nuisance Parameter Estimation

All previous statements have been based on the assumption that *α >* 0 is a *given* nuisance parameter. If *α* needs to be estimated too, then, we drop out of the EF. In this case, an iterative estimation procedure is applied to the EDF representation (5.38). One starts with a fixed nuisance parameter *α(*0*)* and fits the negative-binomial GLM with MLE which provides a first set of MLE *<sup>β</sup>(*1*)* <sup>=</sup> *β(*1*) (α(*0*) )*. Based on this estimate the nuisance parameter is updated *<sup>α</sup>(*0*)* <sup>→</sup> *<sup>α</sup>(*1*)* by maximizing the log-likelihood in *<sup>α</sup>* for given *<sup>β</sup>(*1*)* . Iteration of this procedure then leads to a joint estimation of regression parameter *β* and nuisance parameter *α*. Both MLE steps in this algorithm increase the joint log-likelihood.

*Remark 5.22 (Implementation of the Negative-Binomial GLM in* R*)* Implementation of the negative-binomial model needs some care. There are two R procedures glm and glm.nb that can be used to fit negative-binomial GLMs, the latter being built on the former. The procedure glm is just the classical R procedure [307] that is usually used to fit GLMs within the EDF, it requires to set

$$\texttt{famì 1y=negat.view.binomial.} \texttt{(theta } \texttt{\color{red}{1\;\mathsf{Ink}="log"})...}$$

This parametrization considers the single-parameter linear EF on <sup>N</sup> (for mean *<sup>μ</sup>* <sup>∈</sup> *M*)

$$f\_{\rm NB}(n; \mu, \texttt{the}\texttt{t}\texttt{a}) = \binom{n + \texttt{the}\texttt{t}\texttt{a} - 1}{n} \left(\frac{\mu}{\mu + \texttt{the}\texttt{t}\texttt{a}}\right)^{n} \left(1 - \frac{\mu}{\mu + \texttt{the}\texttt{t}\texttt{a}}\right)^{\texttt{the}\texttt{t}\texttt{a}},$$

where theta *>* 0 denotes the nuisance parameter. The tricky part now is that we have to bring in the different exposures *vi* of all policies 1 ≤ *i* ≤ *n*. That is, we would like to have for claim counts *ni* = *viyi*, see (5.38),

$$\begin{split} f\_{\text{NB}}(\mathbf{y}\_{l};\mu\_{l},v\_{l},\alpha) &= \binom{v\_{l}\mathbf{y}\_{l}+v\_{l}\alpha-1}{v\_{l}\mathbf{y}\_{l}} \left(\frac{v\_{l}\mu\_{l}}{v\_{l}\mu\_{l}+v\_{l}\alpha}\right)^{v\_{l}\mathbf{y}\_{l}} \left(1-\frac{v\_{l}\mu\_{l}}{v\_{l}\mu\_{l}+v\_{l}\alpha}\right)^{v\_{l}\alpha} \\ &= \binom{v\_{l}\mathbf{y}\_{l}+v\_{l}\alpha-1}{v\_{l}\mathbf{y}\_{l}} \left[\left(\frac{\mu\_{l}}{\mu\_{l}+\alpha}\right)^{v\_{l}} \left(1-\frac{\mu\_{l}}{\mu\_{l}+\alpha}\right)^{\alpha}\right]^{v\_{l}}.\end{split}$$

The square bracket can be implemented in glm as a scaled and weighted regression problem, see Listing 5.8 with theta = *α*. This approach provides the correct GLM parameter estimates *<sup>β</sup>*MLE for given *<sup>α</sup>*, however, the outputted AIC values cannot be compared to the Poisson case. Note that the Poisson case of Table 5.5 considers observations *Ni* whereas Listing 5.8 uses *Yi* = *Ni/vi*. For this reason we calculate the log-likelihood and AIC by an own implementation.

The same remark applies to glm.nb, and also nuisance parameter estimation cannot be performed by that routine under different exposures *vi*. Therefore, we have implemented an iterative estimation algorithm ourselves, alternating glm of Listing 5.8 for given *α* and a maximization routine optimize to find the optimal *α* for given *β* using (5.38). We have applied this iteration in Example 5.23, below, and it has converged in 5 iterations.

*Example 5.23 (Negative-Binomial Distribution for Claim Counts)* We revisit the MTPL claim frequency GLM example of Sect. 5.3.4, but we replace the Poisson distribution by the negative-binomial one. We start with the negative-binomial (NB)


**Listing 5.8** Implementation of model NB GLM3

**Table 5.7** Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses (units are in 10−2) and in-sample average frequency of the null models (Poisson and negativebinomial) and the Poisson and negative-binomial GLMs. The optimal model is highlighted in boldface


null model. The NB null model has two parameters, the homogeneous (overall) frequency and the nuisance parameter. MLE of the homogeneous overall frequency is identical to the one in the Poisson null model, and MLE of the nuisance parameter provides *<sup>α</sup>*MLE null = 1*.*059. This is substantially smaller than infinity and suggests over-dispersion. The results are presented on the third line of Table 5.7. We observe a smaller AIC of the NB null model against the Poisson null model which says that we should allow for over-dispersion.

We now focus on the NB GLM. The feature pre-processing is done exactly as in model Poisson GLM3, and we choose the log-link for *g*. We call this model NB GLM3. The iterative estimation procedure outlined above provides a nuisance parameter estimate *<sup>α</sup>*MLE NB = 1*.*810. This is bigger than in the NB null model because the regression structure explains some part of the over-dispersion, however, it is still substantially smaller than infinity which justifies the inclusion of this overdispersion parameter.

The last line of Table 5.7 gives the result of model NB GLM3. From AIC we conclude that we favor the negative-binomial GLM over the Poisson GLM since AIC decreases from 192'716 to 192'113. The in-sample and out-of-sample deviance losses can only be compared within the same models, i.e., the models that have the same cumulant function. This also applies to the negative-binomial models which have cumulant function *κ(θ )* = −*<sup>α</sup>* log*(*<sup>1</sup> <sup>−</sup> *<sup>e</sup><sup>θ</sup> )*. Thus, to compare the NB null model and model NB GLM3, we need to choose the same nuisance parameter *α*. For this reason we added this second NB null model to Table 5.7. This second NB null model no longer uses the MLE *<sup>α</sup>*MLE null , therefore, the corresponding AIC only includes one estimated parameter.

**Table 5.8** Out-of-sample deviance losses: forecast dominance. The optimal model is highlighted in boldface


As mentioned above, deviance losses can only be compared under exactly the same cumulant function (including the same nuisance parameters). If we want to have a more robust model selection, we can consider forecast dominance according to Definition 4.20. Being less ambitious, here, we consider forecast dominance only for the three considered cumulant functions Poisson, negative-binomial with *<sup>α</sup>*MLE null <sup>=</sup> <sup>1</sup>*.*<sup>059</sup> and negative-binomial with *<sup>α</sup>*MLE NB = 1*.*810. The out-of-sample deviance losses are given in Table 5.8 in the different columns. According to this forecast dominance analysis we also give preference to model NB GLM3, but model Poisson GLM3 is pretty close.

Figure 5.7 compares the logged predictors log*( μi)*, 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*, of the models Poisson GLM3 and NB GLM3. We see a huge similarity in these predictors, only high frequency policies are judged slightly differently by the NB model compared to the Poisson model.

Table 5.9 gives the predicted number of claims against the observed ones. We observe that model NB GLM3 predicts more accurately the number of policies with 2 or less claims, but it over-estimates the number of policies with more than 2 claims. This may also be related to the fact that the estimated in-sample frequency has a


**Table 5.9** Contingency table of observed number of policies against predicted number of policies with given claim counts ClaimNb

positive bias in model NB GLM3, see Table 5.7. That is, since we do not work with the canonical link, we do not have the balance property.

**Listing 5.9** drop1 analysis of model NB GLM3


We close this example by providing the drop1 analysis in Listing 5.9. From this analysis we conclude that the feature component Area should be dropped. Of course, this confirms the high collinearity between Density and Area which implies that we do not need both variables in the model. We remark that the AIC values in Listing 5.9 are not on our scale, as stated in Remark 5.22. -

## *5.3.6 Zero-Inflated Poisson Model*

In many applications it is the case that the Poisson distribution does not fully fit the claim counts data because there are too many policies with zero claims, i.e., policies with *Y* = 0, compared to a Poisson assumption. This topic has attracted some attention in the recent actuarial literature, see, e.g., Boucher et al. [43–45], Frees et al. [137], Calderín-Ojeda et al. [62] and Lee [239]. An obvious solution to this problem is to 'artificially' increase the probability of a zero claim compared to a Poisson model, this is the proposal introduced by Lambert [232]. *Y* has a zeroinflated Poisson (ZIP) distribution if the probability weights of *Y* are given by (set *v* = 1)

$$f\_{\rm ZIP}(\mathbf{y};\theta,\pi\_0) = \begin{cases} \pi\_0 + (1 - \pi\_0)e^{-\mu} & \text{for } \mathbf{y} = \mathbf{0}, \\ (1 - \pi\_0)e^{-\mu}\frac{\mu^{\mathbf{y}}}{\mathbf{y}!} & \text{for } \mathbf{y} \in \mathbb{N}, \end{cases}$$

for *<sup>π</sup>*<sup>0</sup> <sup>∈</sup> *(*0*,* <sup>1</sup>*)*, *<sup>μ</sup>* <sup>=</sup> *<sup>e</sup><sup>θ</sup> <sup>&</sup>gt;* 0, and for the Poisson probability weights we refer to (2.4). For *π*<sup>0</sup> *>* 0 the weight of a zero claim *Y* = 0 is increased (inflated) compared to the original Poisson distribution.

#### *Remarks 5.24*

• The ZIP distribution has different interpretations. It can be interpreted as a hierarchical model where we have a latent variable *Z* which indicates with probability *π*<sup>0</sup> that we have an excess zero, and with probability 1 − *π*<sup>0</sup> we have an ordinary Poisson distribution, i.e. for *<sup>y</sup>* <sup>∈</sup> <sup>N</sup><sup>0</sup>

$$\mathbb{P}\_{\theta} \left\{ Y = \mathbf{y} \,|\, Z = z \right\} = \begin{cases} \mathbb{1}\_{\{\mathbf{y} = 0\}} & \text{for } z = 0, \\ e^{-\mu} \varprojlim\_{\mathbf{y} \mathbf{?}} & \text{for } z = 1, \end{cases} \tag{5.41}$$

with <sup>P</sup>[*<sup>Z</sup>* <sup>=</sup> <sup>0</sup>] = <sup>1</sup> <sup>−</sup> <sup>P</sup>[*<sup>Z</sup>* <sup>=</sup> <sup>1</sup>] = *<sup>π</sup>*0.

The latter shows that we can also understand it as a mixture of two distributions, namely, of the Poisson distribution and of a single point measure in *y* = 0 with mixing probability *π*0. Mixture distributions are going to be discussed in Sect. 6.3.1, below. In this sense, we can also interpret the model as a mixed Poisson model with mixing distribution *π(λ)* being a Bernoulli distribution taking values 0 and *μ* with probability *π*<sup>0</sup> and 1 − *π*0, respectively, see (5.36), and the former parameter *λ* = 0 leads to a degenerate Poisson model.


$$f\_{\text{hardle Poisson}}(\mathbf{y};\theta,\pi\_0) = \begin{cases} \pi\_0 & \text{for } \mathbf{y} = \mathbf{0}, \\ (1 - \pi\_0) \frac{e^{-\mu}\frac{\mu^{\text{y}}}{\mathbf{y}!}}{1 - e^{-\mu}} & \text{for } \mathbf{y} \in \mathbb{N}, \end{cases} \tag{5.42}$$

for *<sup>π</sup>*<sup>0</sup> <sup>∈</sup> *(*0*,* <sup>1</sup>*)* and *μ >* 0. For *<sup>π</sup>*<sup>0</sup> *> e*−*<sup>μ</sup>* the weight of a zero claim is increased and for *π*<sup>0</sup> *< e*−*<sup>μ</sup>* it is decreased. This distribution is called a hurdle distribution, because we first need to overcome the hurdle at zero to come to the Poisson model. Lower-truncated distributions are studied in Sect. 6.4, below, and mixture distributions are discussed in Sect. 6.3.1. In general, fitting lower-truncated distributions is challenging because the density and the distribution function should both have tractable forms to perform MLE for truncated distributions. The Expectation-Maximization (EM) algorithm is a useful tool to perform model fitting under truncation. We come back to the hurdle Poisson model in Example 6.19, below, and it is also closely related to the zero-truncated Poisson (ZTP) model discussed in Remarks 6.20.

The first two moments of a ZIP random variable *Y* ∼ *f*ZIP*(*·; *θ,π*0*)* are given by

$$\begin{aligned} \mathbb{E}\_{\theta,\pi\_0}[Y] &= (1 - \pi\_0)\mu, \\ \mathrm{Var}\_{\theta,\pi\_0}(Y) &= (1 - \pi\_0)\mu + (\pi\_0 - \pi\_0^2)\mu^2 = \mathbb{E}\_{\theta,\pi\_0}[Y] \left(1 + \pi\_0\mu\right), \end{aligned}$$

these calculations easily follow with the latent variable *Z* interpretation from above. As a consequence, we receive an over-dispersed model with over-dispersion *π*0*μ* (the latter also follows from the fact that we consider a mixed Poisson distribution with a Bernoulli mixing distribution having weights *π*<sup>0</sup> in 0 and 1 − *π*<sup>0</sup> in *μ >* 0, see (5.37)).

Unfortunately, MLE does not allow for explicit solutions in this model. The score equations of *Yi* <sup>i</sup>*.*i*.*d*.* <sup>∼</sup> *<sup>f</sup>*ZIP*(*·; *θ,π*0*)* are given by

$$\begin{aligned} \nabla\_{\left(\pi\_{0},\mu\right)} \ell\_{Y}(\pi\_{0},\mu) &= \nabla\_{\left(\pi\_{0},\mu\right)} \sum\_{i=1}^{n} \log\left(\pi\_{0} + (1-\pi\_{0})e^{-\mu}\right) \mathbb{1}\_{\{Y\_{i}=0\}} \\ &+ \nabla\_{\left(\pi\_{0},\mu\right)} \sum\_{i=1}^{n} \log\left((1-\pi\_{0})e^{-\mu}\frac{\mu^{\circ}}{\mathbf{y}!}\right) \mathbb{1}\_{\{Y\_{i}>0\}} = 0. \end{aligned}$$

The R package pscl [401] has a function called zeroinfl which uses the general purpose optimizer optim to find the MLEs in the ZIP model. Alternatively, we could explore the EM algorithm for mixture distributions presented in Sect. 6.3, below.

In insurance applications, the ZIP application can be problematic if we have different exposures *vi >* 0 for different insurance policies *i*. In the Poisson GLM case with canonical link choice we typically integrate the different exposures into the offset, see (5.27). However, it is not clear whether and how we should integrate the different exposures into the zero-inflation probability *π*0. It seems natural to believe that shorter exposures should increase *π*0, but the explicit functional form of this increase can be debated, some options are discussed in Section 5 of Lee [239].



**Table 5.10** Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses (units are in 10−2) and in-sample average frequency of the null models (Poisson, negativebinomial and ZIP) and the Poisson, negative-binomial and ZIP GLMs. The optimal model is highlighted in boldface


In the following application, we simply choose *π*<sup>0</sup> independent of the exposures, but certainly this is not the best modeling choice.

*Example 5.25 (ZIP Model for Claim Counts)* We revisit the MTPL claim frequency example of Sect. 5.3.4, but this time we fit a ZIP model. For the Poisson part we use exactly the same GLM regression function as in model Poisson GLM3 and, in particular, we use for the different exposures *vi* of the insurance policies the offset term *oi* = log *vi*, see line 6 of Listing 5.10. This offset only acts on the Poisson part of the ZIP GLM. The zero-inflating probability *π*<sup>0</sup> is modeled with a logistic Bernoulli model, see Sect. 2.1.2. For computational reasons, we choose the null model for the Bernoulli part modeling the zero-inflation *π*0. This is indicated by the "1" on line 5 of Listing 5.10. This 1 should be expanded if we also want to consider a regression model for the zero-inflating probability *π*<sup>0</sup> and, in particular, if we want to integrate an offset term for the exposure. We can set this term to offset(f), where f is a suitable transformation of the exposure. Furthermore, successful calibration requires meaningful starting values, otherwise zeroinfl will not find the MLEs. We start the algorithm in the parameters of model Poisson GLM3, see line 7 of Listing 5.10. The results are presented in Table 5.10.

Firstly, we see that the run times are not fully competitive in this implementation, even if we choose the null model for the zero-inflating probability *π*0, i.e., only


**Table 5.11** Out-of-sample deviance losses: forecast dominance. The optimal model is highlighted in boldface

**Table 5.12** Contingency table of observed numbers of policies against predicted numbers of policies with given claim counts ClaimNb


one intercept parameter is involved for determining *π*0. Secondly, in this model we cannot calculate deviance losses because the saturated model has two parameters for each observation. Thirdly, the model does not satisfy the balance property though we work with the canonical links for the Poisson part and the Bernoulli part, however, this property gets lost under the combination of these two model parts.

Most interesting are the AIC values. We observe that the ZIP GLM improves the Poisson GLM, but it has a bigger AIC value than the negative-binomial GLM. From this we conclude that we give preference to the negative-binomial model in our case.

Considering forecast dominance according to Definition 4.20, but restricted to the three deviance losses studied in Example 5.23, we receive Table 5.11. Also this table gives preference to the negative-binomial GLM. However, if we consider the table of the observed numbers of policies against the predicted numbers of claims, see Table 5.12, we give preference to the ZIP GLM because it has the lowest *χ*2 value, i.e., it reflects best (in-sample) our observations.

Figure 5.8 compares the resulting predictors on the log-scale. From this plot we conclude that in our example the predictors of the ZIP GLM are closer to the Poisson ones than the NB GLM predictors. In a next step, one could refine the zero-inflating probability *π*<sup>0</sup> modeling by integrating the exposure and further feature information. This would lead to a further model improvement. We refrain here from doing so and close this example; in Example 6.19, below, we study the hurdle Poisson model. - **Fig. 5.8** Comparison linear predictors of the NB and ZIP GLMs against the ones of the Poisson GLM

## *5.3.7 Lab: Gamma GLM for Claim Sizes*

As a second example we consider claim size modeling within GLMs. For this example we do not use the French MTPL claims data because the empirical density plot in Fig. 13.15 indicates that a GLM will not fit to that data. The French MTPL data seems to have three distinct modes, which suggests to use a mixture distribution. Moreover, the log-log plot indicates a regularly varying tail, which cannot be captured by the EDF on the original observation scale; we are going to study this data in Example 6.14, below. Here, we use the Swedish motorcycle data, previously used in the textbook of Ohlsson–Johansson [290] and described in Chap. 13.2. From Fig. 5.9 we see that the empirical density has one mode, and the log-log plot supports light tails, i.e., the gamma model might be a suitable choice for this data. Therefore, we choose a gamma GLM with log-link *g*. As described above, the log-link is not the canonical link for the gamma EDF distribution but it ensures the right sign w.r.t. the linear predictor *ηi* = *β, x<sup>i</sup>* . Working with the log-link in the gamma model will imply that the balance property is not fulfilled.

**Fig. 5.9** (lhs) Empirical density, (middle) empirical distribution and (rhs) log-log plot of claim amounts of the Swedish motorcycle data presented in Chap. 13.2

#### **Feature Engineering**

We have 4 continuous feature components OwnerAge, RiskClass, VehAge and BonusClass, one binary feature component Gender and a categorical component Area, see Listing 13.4. We have decided for a minimal feature engineering; we refer to Figs. 13.19 (rhs) and 13.20 (rhs) for descriptive plots. We use the continuous variables directly in a log-linear fashion, we add quadratic terms for OwnerAge and VehAge, we merge RiskClass 6 and 7, and we censor VehAge at 20. Area is categorical, but we may interpret the Zone levels as ordinal categorical, and mapping them to integers allows us to use them in a continuous fashion; Fig. 13.19 (middle row, rhs) shows that this is a reasonable choice. Moreover, we merge Zone 5, 6 and 7 due to small volumes and their similar behavior.

#### **Gamma Generalized Linear Model**

The Swedish motorcycle claim amount data poses the special difficulty that we do not have individual claim observations *Zi,j* , but we only know the total claim amounts *Si* <sup>=</sup> *Ni <sup>j</sup>*=<sup>1</sup> *Zi,j* and the number of claims *Ni* on each insurance policy; Fig. 5.9 shows average claims *Si/Ni* of insurance policies *i* with *Ni >* 0. In general, this poses a problem in statistical modeling, but in the gamma model this problem can be handled because the gamma distribution is closed under aggregation of i.i.d. gamma claims *Zi,j* . In all what follows in this section, we only study insurance policies with *Ni >* 0, and we label these insurance policies *i* accordingly.

Assume that *Zi,j* are i.i.d. gamma distributed with shape parameter *αi* and scale parameter *ci*, we refer to (2.6). The mean, the variance and the moment generating function of *Zi,j* are given by

$$\mathbb{E}[Z\_{l,j}] = \frac{\alpha\_l}{c\_l}, \qquad \text{Var}(Z\_{l,j}) = \frac{\alpha\_l}{c\_l^2} \qquad \text{and} \qquad M\_{Z\_{l,j}}(r) = \left(\frac{c\_l}{c\_l - r}\right)^{\alpha\_l}, \tag{5.43}$$

where the moment generating function requires *r<ci* to be finite. Assuming that the number of claims *Ni* is a known positive integer *ni* <sup>∈</sup> <sup>N</sup>, we see from the moment generating function that *Si* <sup>=</sup> *ni <sup>j</sup>*=<sup>1</sup> *Zi,j* is again gamma distributed with shape parameter *niαi* and scale parameter *ci*. We change the notation from *Ni* to *ni* to emphasize that the number of claims is treated as a known constant (and also to avoid using the notation of conditional probabilities, here). Finally, we scale *Yi* = *Si/(niαi)* ∼ *(niαi, niαici)*. This random variable *Yi* has a single-parameter EDF gamma distribution with weight *vi* = *ni*, dispersion *ϕi* = 1*/αi* and cumulant function *κ(θi)* = − log*(*−*θi)*, for *θi* ∈ = *(*−∞*,* 0*)*,

$$Y\_l \sim f(\mathbf{y}; \theta\_l, v\_l/\varphi\_l) = \exp\left\{\frac{\mathbf{y}\theta\_l - \kappa(\theta\_l)}{\varphi\_l/v\_l} + a(\mathbf{y}; v\_l/\varphi\_l)\right\} \tag{5.44}$$

$$= \frac{(-\theta\_l \alpha\_l v\_l)^{v\_l \alpha\_l}}{\Gamma(v\_l \alpha\_l)} \mathbf{y}^{v\_l \alpha\_l - 1} \exp\left\{-(-\theta\_l \alpha\_l v\_l)\mathbf{y}\right\},$$

and the canonical parameter is *θi* = −*ci*. For our GLM analysis we treat the shape parameter *αi* ≡ *α >* 0 as a nuisance parameter that does not depend on the specific policy *i*, i.e., we set constant dispersion *ϕ* = 1*/α*, and only the scale parameter *ci* is chosen policy dependent through *θi* = −*ci*.

Random variable *Yi* = *Si/(niα)* ∼ *(niα, niαci)* gives the reproductive form of the gamma EDF, see Remarks 2.13. In applications, this form is not directly useful because under unknown shape parameter *α*, we cannot calculate observations *Yi* = *Si/(niα)*. For this reason, we parametrize the model differently, here. We consider instead

$$Y\_l = \mathbb{S}/n\_l \sim \Gamma(n\_l a, n\_l c\_l). \tag{5.45}$$

This (new) random variable has the same gamma EDF (5.44), we only need to reinterpret the canonical parameter as *θi* = −*ci/α*. Then, we choose the log-link for *g* which implies

$$
\mu\_l = \mathbb{E}\_{\theta\_l}[Y\_l] = \kappa'(\theta\_l) = -\frac{1}{\theta\_l} = \exp\{\eta\_l\} = \exp\langle \mathfrak{B}, \mathfrak{x}\_l \rangle,
$$

if *<sup>x</sup><sup>i</sup>* <sup>∈</sup> *<sup>X</sup>* <sup>⊂</sup> <sup>R</sup>*q*+<sup>1</sup> describes the pre-processed features of policy *<sup>i</sup>*. The gamma GLM is now fully specified and can be fitted to the data; from Example 5.5 we know that we have a concave maximization problem. We call this model Gamma GLM1 (with the feature pre-processing as described above). Note that the (constant) dispersion parameter *ϕ* cancels in the score equations, thus, we do not need to explicitly specify the nuisance parameter *α* to estimate regression parameter *β* ∈ R*q*+1.

#### **Maximum Likelihood Estimation and Model Selection**

Because we have only few claims data in this Swedish motorcycle example (only *m* = 656 insurance policies suffer claims), we do not perform a generalization analysis with learning and test samples. In this situation we need all data for model fitting, and model performance is analyzed with AIC and with tenfold crossvalidation.

The in-sample deviance loss in the gamma GLM is given by

$$\mathfrak{D}(\mathcal{L}, \widehat{\mu}(\cdot)) = \frac{2}{m} \sum\_{l=1}^{m} \frac{n\_l}{\varphi} \left( \frac{Y\_l - \widehat{\mu}(\mathbf{x}\_l)}{\widehat{\mu}(\mathbf{x}\_l)} - \log \left( \frac{Y\_l}{\widehat{\mu}(\mathbf{x}\_l)} \right) \right), \tag{5.46}$$

where *i* runs over the policies *i* = 1*,...,m* with positive claims *Yi* = *Si/ni >* 0, and *μ(xi)* <sup>=</sup> exp *<sup>β</sup>*MLE*, <sup>x</sup><sup>i</sup>* is the MLE estimated regression function. Similar to the Poisson case (5.29), McCullagh–Nelder [265] derive the following behavior

**Fig. 5.10** (lhs) Empirical density of *Yi* and (rhs) empirical density of *Y*1*/*<sup>3</sup> *i*

for the gamma unit deviance around its mode, see Section 7.2 and Figure 7.2 in McCullagh–Nelder [265],

$$\mathfrak{d}\left(Y\_{l},\mu\_{l}\right) \approx \, \mathfrak{Y}\_{l}^{2/3} \left(Y\_{l}^{-1/3} - \mu\_{l}^{-1/3}\right)^{2},\tag{5.47}$$

this uses that the log-likelihood is symmetric around its mode for scale *μ*−1*/*<sup>3</sup> *<sup>i</sup>* , see Fig. 5.5 (middle). This shows that the gamma deviance scales differently around *Yi* compared to the square loss function. From this we receive an approximation to the deviance residuals (for *v/ϕ* = 1)

$$r\_l^D = \text{sign}(Y\_l - \mu\_l)\sqrt{\mathfrak{d}(Y\_l, \mu\_l)} \approx 3\left(\left(\frac{Y\_l}{\mu\_l}\right)^{1/3} - 1\right) = 3\frac{Y\_l^{1/3} - \mu\_l^{1/3}}{\mu\_l^{1/3}}.\tag{5.48}$$

This is the cube-root transformation derived by Wilson–Hilferty [383]. This suggests that if the empirical distribution of *Y* <sup>1</sup>*/*<sup>3</sup> *<sup>i</sup>* looks roughly Gaussian we can use a gamma distribution. Figure 5.10 gives the empirical densities of *Yi* on the left-hand side and of *Y* <sup>1</sup>*/*<sup>3</sup> *<sup>i</sup>* on the right-hand side. The latter looks roughly Gaussian (except of the second mode close to 4), this supports the use of a gamma model.

Listing 5.11 provides the summary statistics of the fitted model Gamma GLM1; note that we integrate the number of claims *ni* through scaling into the weights. We have *q* + 1 = 9 regression parameters, and from this summary statistics we observe that not all variables should be kept in the model. If we perform backward elimination using drop1 in each step, see Sect. 5.3.3, we first drop BonusClass and then Gender, resulting in a reduced model with 7 parameters. We call this reduced model Gamma GLM2.


```
1 Call:
2 glm(formula = ClaimAmount/ClaimNb ~ OwnerAge + I(OwnerAge^2) +
3 AreaGLM + RiskClass + VehAge + I(VehAge^2) + Gender + BonusClass,
4 family = Gamma(link = "log"), data = mcdata0, weights = ClaimNb)
5
6 Deviance Residuals:
7 Min 1Q Median 3Q Max
8 -3.3683 -1.4585 -0.5979 0.4354 3.4763
9
10 Coefficients:
11 Estimate Std. Error t value Pr(>!t!)
12 (Intercept) 8.9737854 0.5532821 16.219 < 2e-16 *** 13 OwnerAge 0.1072781 0.0280862 3.820 0.000147 *** 14 I(OwnerAge^2) -0.0014508 0.0003489 -4.158 3.65e-05 *** 15 AreaGLM -0.0768512 0.0368284 -2.087 0.037303 * 16 RiskClass 0.0615575 0.0327553 1.879 0.060651 .
17 VehAge -0.2051148 0.0296184 -6.925 1.05e-11 *** 18 I(VehAge^2) 0.0062649 0.0015946 3.929 9.45e-05 *** 19 GenderMale 0.1085538 0.1673443 0.649 0.516772
20 BonusClass 0.0089004 0.0225371 0.395 0.693029
21 ---
22 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
23
24 (Dispersion parameter for Gamma family taken to be 1.536577)
25
26 Null deviance: 1368.0 on 655 degrees of freedom
27 Residual deviance: 1126.5 on 647 degrees of freedom
28 AIC: 14922
29
30 Number of Fisher Scoring iterations: 11
```
**Table 5.13** Run times, number of parameters, AICs, Pearson's dispersion estimate, in-sample losses, tenfold cross-validation losses and the in-sample average claim amounts of the null model (gamma intercept model) and the gamma GLMs


The results of models Gamma GLM1 and Gamma GLM2 are presented in Table 5.13. We show AICs, Pearson's dispersion estimate, the in-sample deviance losses on all available data, the corresponding tenfold cross-validation losses, and the average claim amounts.

Firstly, we observe that the GLMs do not meet the balance property. This is implied by the fact that we do not use the canonical link to avoid any sort of difficulty of dealing with the one-sided bounded effective domain = *(*−∞*,* 0*)*. For pricing, the intercept parameter *β* MLE <sup>0</sup> should be shifted to eliminate this bias, i.e, we need to shift this parameter under the log-link by − log*(*25 130*/*24 641*)* for model Gamma GLM2.

Secondly, the in-sample and tenfold cross-validation losses are not directly comparable to AIC. Observe that we need to know the dispersion parameter *ϕ* in order to calculate both of these statistics. For the in-sample and cross-validation losses we have set *ϕ* = 1, thus, all these figures are directly comparable. For AIC we have estimated the dispersion parameter *ϕ* with MLE. This is the reason for increasing the number of parameters in Table 5.13 by +1. Moreover, the resulting AICs differ from the ones received from the R command glm, see, for instance, Listing 5.11. The AIC value in Listing 5.11 does not consider all terms appropriately due to the inclusion of weights, this is similar to Remark 5.22, it uses the deviance dispersion estimate *<sup>ϕ</sup>*D, i.e., not the MLE and (still) increases the number of parameters by 1 because the dispersion is estimated. For these reasons, we have implemented our own code for calculating AIC. Both AIC and the tenfold crossvalidation losses say that we should give preference to model Gamma GLM2.

The dispersion estimate in Listing 5.11 corresponds to Pearson's estimate

$$\widehat{\varphi}^{\mathrm{P}} = \frac{1}{m - (q + 1)} \sum\_{l=1}^{m} n\_l \frac{(Y\_l - \widehat{\mu}\_l)^2}{\widehat{\mu}\_l^2}. \tag{5.49}$$

We observe that the dispersion estimate is roughly 1.5 which gives an estimate of the shape parameter *α* = 1*/ϕ* of 2*/*3. A shape parameter less than 1 implies that the density of the gamma distribution is strictly decreasing, see Fig. 2.1. Often this is a sign that the model does not fully fit the data, and if we use this model for simulation we may receive too many observations close to zero compared to the true data. A shape parameter less than 1 may be implied by more heterogeneity in the data compared to what the chosen gamma GLM allows for or by large claims that cannot be explained by the present gamma density structure. Thus, there is some sign here that the data is more heavy-tailed than our model choice suggests. Alternatively, there might be some need to also model the shape parameter with a regression model; this could be done using the vector-valued parameter EF representation of the gamma model, see Sect. 2.1.3. In view of Fig. 5.10 (rhs) it may also be that the feature information is not sufficient to describe the second mode in 4, thus, we probably need more explanatory information to reduce dispersion.

In Fig. 5.11 we give the Tukey–Anscombe plot and a QQ plot. Note that the observations for *ni* = 1 follow a gamma distribution with shape parameter *α* and scale parameter *ci* = *α/μi* = −*αθi*. Thus, if we scale *Yi/μi*, we receive i.i.d. gamma random variables with shape and scale parameters equal to *α*. This then allows us for *ni* <sup>=</sup> 1 to plot the empirical distribution of *Yi/ μi* against *(α, α)* in a QQ plot where we estimate 1*/α* by Pearson's dispersion estimate. The Tukey– Anscombe plot looks reasonable, but the QQ plot shows that the gamma model does not entirely fit the data. From this plot we cannot conclude whether the gamma distribution is causing the problem or whether it is a missing term in the regression structure. We only see that the data is over-dispersed, resulting in more heavy-tailed observations than the theoretical gamma model can explain, and a compensation by too many small observations (which is induced by over-dispersion, i.e., a shape parameter smaller than one). In the network chapter we will refine the regression function, keeping the gamma assumption, to understand which modeling part is causing the difficulty.

*Remark 5.26* For the calculation of AIC in Table 5.13 we have used the MLE of the dispersion parameter *ϕ*. This is obtained by solving the score equation (5.11) for the

**Fig. 5.11** (lhs) Tukey–Anscombe plot of the fitted model Gamma GLM2, and (rhs) QQ plot of the fitted model Gamma GLM2

gamma case. It is given by, we set *α* = 1*/ϕ* and we calculate the MLE of *α* instead,

$$\frac{\partial}{\partial \alpha} \ell Y(\mathfrak{F}, \alpha) = \sum\_{i=1}^{n} v\_i \left[ Y\_i h(\mu(\mathfrak{x}\_i)) - \kappa \left( h(\mu(\mathfrak{x}\_i)) \right) + \log Y\_i + \log(a v\_i) + 1 - \Psi(a v\_i) \right] = 0,$$

where *(α)* =  *(α)/ 
(α)* is the digamma function. We calculate the second derivative w.r.t. *α*, see also (2.30),

$$\frac{\partial^2}{\partial \alpha^2} \ell \mathbf{y}(\mathfrak{B}, \alpha) = \sum\_{i=1}^n v\_i \left[ \frac{1}{\alpha} - v\_i \Psi'(\alpha v\_i) \right] = \sum\_{i=1}^n v\_i^2 \left[ \frac{1}{\alpha v\_i} - \Psi'(\alpha v\_i) \right] < 0 \qquad \text{for } \alpha > 0, 1$$

the negativity follows from Theorem 1 in Alzner [9]. In fact, the function log *α* − *(α)* is strictly completely monotonic for *α >* 0. This says that the log-likelihood *<sup>Y</sup> (β, α)* is a concave function in *α >* 0 and the solution to the score equation is unique, giving the MLE of *α* and *ϕ*, respectively.

## *5.3.8 Lab: Inverse Gaussian GLM for Claim Sizes*

We present the inverse Gaussian GLM in this section as a competing model to the gamma GLM studied in the previous section.

#### **Infinite Divisibility**

In the gamma model above we have used that the total claim amount *<sup>S</sup>* <sup>=</sup> *<sup>n</sup> <sup>j</sup>*=<sup>1</sup> *Zj* has a gamma distribution for given claim counts *N* = *n >* 0 and i.i.d. gamma claim sizes *Zj* . This property is closely related to divisibility. A random variable *S* is called divisible by *<sup>n</sup>* <sup>∈</sup> <sup>N</sup> if there exist i.i.d. random variables *<sup>Z</sup>*1*,...,Zn* such that

$$\mathbf{S} \stackrel{(\mathbf{d})}{=} \sum\_{j=1}^{n} \mathbf{Z}\_{j},$$

and *<sup>S</sup>* is called *infinitely divisible* if *<sup>S</sup>* is divisible by *<sup>n</sup>* for all *<sup>n</sup>* <sup>∈</sup> <sup>N</sup>. The EDF is based on parameters *(θ , ω)* ∈ × *W*. Jørgensen [203] gives the following interesting result.

**Theorem 5.27 (Theorem 3.7 in Jørgensen [203], Without Proof)** *Choose a member of the EDF with parameter set* × *W. Then*


This theorem tells us how to aggregate and disaggregate within EDFs, e.g., the Poisson, gamma and inverse Gaussian models are infinitely divisible, and the binomial distribution is divisible by *n* with the disaggregated random variables belonging to the same EDF and the same canonical parameter, see Sect. 2.2.2. In particular, we also refer to Corollary 2.15 on the convolution property.

#### **Inverse Gaussian Generalized Linear Model**

Alternatively to the gamma GLM one often explores an inverse Gaussian GLM which has a cubic variance function *V (μ)* <sup>=</sup> *<sup>μ</sup>*3. We bring this inverse Gaussian model into the same form as the gamma model of Sect. 5.3.7, so that we can aggregate claims within insurance policies. The mean, the variance and the moment generating function of an inverse Gaussian random variable *Zi,j* with parameters *αi, ci >* 0 are given by

$$\mathbb{E}[Z\_{l,j}] = \frac{\alpha\_l}{c\_l}, \quad \text{Var}(Z\_{l,j}) = \frac{\alpha\_l}{c\_l^3} \quad \text{and} \quad M\_{Z\_{l,j}}(r) = \exp\left\{\alpha\_l \left[c\_l - \sqrt{c\_l^2 - 2r}\right] \right\},$$

where the moment generating function requires *r<c*<sup>2</sup> *<sup>i</sup> /*2 to be finite. From the moment generating function we see that *Si* <sup>=</sup> *ni <sup>j</sup>*=<sup>1</sup> *Zi,j* is inverse Gaussian distributed with parameters *niαi* and *ci*. Finally, we scale *Yi* = *Si/(niαi)* which provides us with an inverse Gaussian distribution with parameters *n*1*/*<sup>2</sup> *<sup>i</sup> α* 1*/*2 *<sup>i</sup>* and *n*1*/*<sup>2</sup> *<sup>i</sup> α* 1*/*2 *<sup>i</sup> ci*. This random variable *Yi* has a single-parameter EDF inverse Gaussian distribution in its reproductive form, namely,

$$Y\_l \sim f(\mathbf{y}; \theta\_l, v\_l/\varphi\_l) = \exp\left\{\frac{\mathbf{y}\theta\_l - \kappa(\theta\_l)}{\varphi\_l/v\_l} + a(\mathbf{y}; v\_l/\varphi\_l)\right\} \tag{5.50}$$

$$= \frac{\alpha\_l^{1/2}}{\sqrt{\frac{2\pi}{v\_l}}\mathbf{y}^3} \exp\left\{-\frac{\alpha\_l}{2\mathbf{y}/v\_l}\left(1 - \sqrt{-2\theta\_l}\mathbf{y}\right)^2\right\},$$

with cumulant function *κ(θ )* = −√−2*<sup>θ</sup>* for *<sup>θ</sup>* <sup>∈</sup> <sup>=</sup> *(*−∞*,* <sup>0</sup>], weight *vi* <sup>=</sup> *ni*, dispersion parameter *ϕi* <sup>=</sup> <sup>1</sup>*/αi* and canonical parameter *θi* = −*c*<sup>2</sup> *<sup>i</sup> /*2.

Similarly to the gamma case, this representation is not directly useful if the parameter *αi* is not known. Therefore, we parametrize this model differently. Namely, we consider

$$Y\_l = \text{Si}/n\text{i} \quad \sim \text{InvGauss}\left(n\_l^{1/2}\alpha\_l, n\_l^{1/2}c\_l\right). \tag{5.51}$$

This re-scaled random variable has that same inverse Gaussian EDF (5.50), but we need to re-interpret the parameters. We have dispersion parameter *ϕi* <sup>=</sup> <sup>1</sup>*/α*<sup>2</sup> *i* and canonical parameter *θi* = −*c*<sup>2</sup> *<sup>i</sup> /(*2*α*<sup>2</sup> *<sup>i</sup> )*. For our GLM analysis we will treat the parameter *αi* ≡ *α >* 0 as a nuisance parameter that does not depend on the specific policy *<sup>i</sup>*. Thus, we have constant dispersion *<sup>ϕ</sup>* <sup>=</sup> <sup>1</sup>*/α*<sup>2</sup> and only the scale parameter *ci* is assumed to be policy dependent through the canonical parameter *θi* = −*c*<sup>2</sup> *<sup>i</sup> /(*2*α*2*)*.

We are now in the same situation as in the gamma case in Sect. 5.3.7. We choose the log-link for *g* which implies

$$\mu\_{\boldsymbol{l}} = \mathbb{E}\_{\theta\_{\boldsymbol{l}}}[Y\_{\boldsymbol{l}}] = \kappa'(\theta\_{\boldsymbol{l}}) = \frac{1}{\sqrt{-2\theta\_{\boldsymbol{l}}}} = \exp\{\eta\_{\boldsymbol{l}}\} = \exp\langle \boldsymbol{\theta}, \mathbf{x}\_{\boldsymbol{l}} \rangle,$$

for *<sup>x</sup><sup>i</sup>* <sup>∈</sup> *<sup>X</sup>* <sup>⊂</sup> <sup>R</sup>*q*+<sup>1</sup> describing the pre-processed features of policy *<sup>i</sup>*. We use the same feature pre-processing as in model Gamma GLM2, and we call this resulting model IG GLM2. Again the constant dispersion parameter *<sup>ϕ</sup>* <sup>=</sup> <sup>1</sup>*/α*<sup>2</sup> cancels in the score equations, thus, we do not need to explicitly specify the nuisance parameter *<sup>α</sup>* to estimate the regression parameter *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+1. However, there is an important difference to the gamma GLM, namely, as stated in Example 5.6, we do not have a concave maximization problem and Fisher's scoring method needs a suitable initial value. We start the fitting algorithm in the parameters of model Gamma GLM2.

The in-sample deviance loss in the inverse Gaussian GLM is given by

$$\mathfrak{D}(\mathcal{L}, \widehat{\mu}(\cdot)) = \frac{1}{m} \sum\_{i=1}^{m} \frac{n\_i}{\varphi} \frac{(Y\_i - \widehat{\mu}(\mathbf{x}\_i))^2}{\widehat{\mu}(\mathbf{x}\_i)^2 \ Y\_i},\tag{5.52}$$

where *i* runs over the policies *i* = 1*,...,m* with positive claims *Yi* = *Si/ni >* 0, and *μ(xi)* <sup>=</sup> exp *<sup>β</sup>*MLE*, <sup>x</sup><sup>i</sup>* is the MLE estimated regression function. The unit deviances behave as

$$\mathfrak{d}\left(Y\_{l},\mu\_{l}\right) = Y\_{l}\left(Y\_{l}^{-1} - \mu\_{l}^{-1}\right)^{2},\tag{5.53}$$

**Table 5.14** Run times, number of parameters, AICs, in-sample losses, tenfold cross-validation losses and the in-sample average claim amounts of the null gamma model, model Gamma GLM2, the null inverse Gaussian model, and model inverse Gaussian GLM2; the deviance losses use unit dispersion *ϕ* = 1


note that the log-likelihood is symmetric around its mode for scale *μ*−<sup>1</sup> *<sup>i</sup>* , see Fig. 5.5 (rhs). From this we receive deviance residuals (for *v/ϕ* = 1)

$$r\_i^{\mathcal{D}} = \text{sign}(Y\_l - \mu\_l)\sqrt{\mathfrak{d}(Y\_l, \mu\_l)} = Y\_l^{1/2} \left(\mu\_l^{-1} - Y\_l^{-1}\right).$$

Thus, these residuals behave as *Y* <sup>1</sup>*/*<sup>2</sup> *<sup>i</sup>* for *Yi* → ∞ (and fixed *<sup>μ</sup>*−<sup>1</sup> *<sup>i</sup>* ), which is more heavy-tailed than the cube-root behavior *Y* <sup>1</sup>*/*<sup>3</sup> *<sup>i</sup>* in the gamma case, see (5.48). Another difference to the gamma case is that the deviance loss (5.52) is not scaleinvariant, see also (11.4), below.

We revisit the example of Table 5.13, but we replace the gamma distribution by the inverse Gaussian distribution. The results in Table 5.14 show that the inverse Gaussian model is not fully competitive on this data set. In view of (5.43) we observe that the coefficient of variation (standard deviation divided by mean) is in the gamma model given by 1*/* <sup>√</sup>*α*, thus, in the gamma model this coefficient of variation is independent of the expected claim size *μi* and only depends on the shape parameter *α*. In the inverse Gaussian model the coefficient of variation is given by

$$\text{Vco}(Z\_{i,j}) = \frac{\sqrt{\text{Var}(Z\_{i,j})}}{\text{E}[Z\_{i,j}]} = \frac{\sqrt{\mu\_i}}{\alpha},$$

thus, it monotonically increases in the expected claim size *μi*. It seems that this structure is not fully suitable for this data set, i.e., there is no indication that the coefficient of variation increases in the expected claim size. We come back to a comparison of the gamma and the inverse Gaussian model in Sect. 11.1, below.

## *5.3.9 Log-Normal Model for Claim Sizes: A Short Discussion*

Another way to improve the gamma model of Sect. 5.3.7 could be to use a lognormal distribution instead. In the above situation this does not work because the observations are not in the right format. If the claim observations *Zi,j* are lognormally distributed, then log*(Zi,j )* are normally distributed. Unfortunately, in our Swedish motorcycle data set we do not have individual claim observations *Zi,j* , but the provided information is aggregated over all claims per insurance policy, i.e., *Si* <sup>=</sup> *Ni <sup>j</sup>*=<sup>1</sup> *Zi,j* . Therefore, there is no possibility here to challenge the gamma framework of Sect. 5.3.7 with a corresponding log-normal framework, because the log-normal framework is not closed under summation of i.i.d. log-normally distributed random variables.

We would like to give some remarks that concern calculations on the log-scale (or any other strictly increasing and concave transformation of the original data). For the log-normal distribution, as well as in similar cases like the log-gamma distribution, one works with logged observations *Yi* = log*(Zi)*. This is a strictly monotone transformation and the MLEs in the log-normal model based on observations *Zi* and in the normal model based on observations *Yi* = log*(Zi)* coincide. This can be seen from the following calculation. We start from the log-normal density on <sup>R</sup>+, and we do a transformation of variable *z >* <sup>0</sup> <sup>→</sup> *<sup>y</sup>* <sup>=</sup> log*(z)* <sup>∈</sup> <sup>R</sup> with *dy* <sup>=</sup> *dz/z*

$$\begin{split} f\_{\mathrm{LN}}(z;\,\mu,\,\sigma^{2})dz &= \frac{1}{\sqrt{2\pi\sigma^{2}}} \frac{1}{z} \exp\left\{-\frac{1}{2\sigma^{2}} \left(\log(z) - \mu\right)^{2}\right\} dz \\ &= \frac{1}{\sqrt{2\pi\sigma^{2}}} \exp\left\{-\frac{1}{2\sigma^{2}} \left(\mathbf{y} - \mu\right)^{2}\right\} d\mathbf{y} = f\_{\Phi}(\mathbf{y};\,\mu,\,\sigma^{2})d\mathbf{y}. \end{split}$$

From this we see that the MLEs will coincide.

In many situations, one assumes that *σ*<sup>2</sup> *>* 0 is a given nuisance parameter, and one models *x* → *μ(x)* with a GLM within the single-parameter EDF. In the log-normal/Gaussian case one typically chooses the canonical link on the log-scale which is the identity function. This then allows one to perform a classical linear regression for *μ(x)* = *β, x* using the logged observations *Y* = *(Y*1*,...,Yn)*- = *(*log*(Z*1*), . . . ,*log*(Zn))*-, and the corresponding MLE is given by

$$
\widehat{\boldsymbol{\mathfrak{g}}}^{\text{MLE}} = (\mathfrak{X}^{\top}\mathfrak{X})^{-1}\mathfrak{X}^{\top}\boldsymbol{Y},\tag{5.54}
$$

for full rank *<sup>q</sup>* <sup>+</sup> <sup>1</sup> <sup>≤</sup> *<sup>n</sup>* design matrix <sup>X</sup>. Note that in this case we have a closedform solution for the MLE of *β*. This is called the homoskedastic case because all observations *Yi* are assumed to have the same variance *σ*2, otherwise, in the heteroskedastic case, we would still have to include the covariance matrix.

Since we work with the canonical link on the log-scale we have the balance property on the log-scale, see Corollary 5.7. Thus, we receive unbiasedness

$$\sum\_{l=1}^{n} \mathbb{E}\_{\mathcal{J}} \left[ \mathbb{E}\_{\widehat{\mathcal{J}}} \boldsymbol{\mathbb{1}\_{l}} \boldsymbol{\mathbb{1}\_{l}} \right] = \sum\_{l=1}^{n} \mathbb{E}\_{\mathcal{J}} \left[ \langle \widehat{\mathcal{J}}^{\text{MLE}}, \boldsymbol{\mathbf{x}}\_{l} \rangle \right] = \sum\_{l=1}^{n} \mathbb{E}\_{\mathcal{J}} \left[ \boldsymbol{Y}\_{l} \right] = \sum\_{l=1}^{n} \mu(\mathbf{x}\_{l}). \tag{5.55}$$

**Fig. 5.12** (lhs) Tukey–Anscombe plot of the fitted Gaussian model *μ(xi)* on the logged claim sizes *Yi* <sup>=</sup> log*(Zi)*, and (rhs) estimated means *μZi* as a function of *μ(xi)* considering heteroskedasticity *σ (xi)*

If we move back to the original scale of the observations *Zi* we receive from the log-normal assumption

$$\mathbb{E}\_{(\widehat{\mathfrak{P}}^{\mathrm{MLE}}, \sigma^2)}[Z\_l] = \exp\left\{ \langle \widehat{\mathfrak{P}}^{\mathrm{MLE}}, \mathfrak{x}\_l \rangle + \sigma^2/2 \right\}.$$

Therefore, we need to adjust with the nuisance parameter *σ*<sup>2</sup> for the backtransformation to the original observation scale. At this point, typically, the difficulties start. Often, a good back-transformation involves a feature dependent variance parameter *σ*2*(xi)*, thus, in many practical applications the homoskedasticity assumption is not fulfilled, and a constant variance parameter choice leads to a poor model on the original observation scale.

A suitable estimation of *σ*2*(xi)* may turn out to be rather difficult. This is illustrated in Fig. 5.12. The left-hand side of this figure shows the Tukey–Anscombe plot of the homoskedastic case providing unscaled (*σ*<sup>2</sup> <sup>≡</sup> 1) (Pearson's) residuals on the log-scale

$$r\_l^\mathbb{P} = \log(Z\_l) - \widehat{\mu}(\mathbf{x}\_l) = Y\_l - \widehat{\mu}(\mathbf{x}\_l).$$

The light-blue color shows an insurance policy dependent standard deviation estimate *σ (xi)*. In our case this estimate is non-monotone in *μ(xi)* (which is quite common on real data). Using this estimate we can estimate the means of the lognormal random variables by

$$
\widehat{\mu}\_{Z\_l} = \widehat{\mathbb{E}}[Z\_l] = \exp\left\{ \widehat{\mu}(\mathfrak{x}\_l) + \widehat{\sigma}(\mathfrak{x}\_l)^2 / 2 \right\}.
$$

The right-hand side of Fig. 5.12 plots these estimated means *μZi* against the estimated means *μ(xi)* on the log-scale. We observe a graph that is non-monotone, implied by the non-monotonicity of the standard deviation estimate *σ (xi)* as a function of *μ(xi)*. This non-monotonicity is not bad per se, as we still have a proper statistical model, however, it might be rather counter-intuitive and difficult to explain. For this reason it is advisable to directly model the expected value by one single function, and not to decompose it into different regression functions.

Another important point to be considered is that for model selection using AIC we have to work on the same scale for all models. Thus, if we use a gamma model to model *Zi*, then for an AIC selection we need to evaluate also the log-normal model on that scale. This can be seen from the justification in Sect. 4.2.3.

Finally, we focus on unbiasedness. Note that on the log-scale we have unbiasedness (5.55) through the balance property. Unfortunately, this does not carry over to the original scale. We give a small example, where we assume that there is neither any uncertainty about the distributional model nor about the nuisance parameter. That is, we assume that *Zi* are i.i.d. log-normally distributed with parameters *μ* and *σ*2, where only *μ* is unknown. The MLE of *μ* is given by

$$
\widehat{\mu}^{\text{MLE}} = \frac{1}{n} \sum\_{i=1}^{n} \log(Z\_i) \sim \mathcal{N}(\mu, \sigma^2/n).
$$

In this case we have

$$\begin{split} \frac{1}{n} \sum\_{i=1}^{n} \mathbb{E}\_{(\mu, \sigma^{2})} \left[ \mathbb{E}\_{(\hat{\mu}^{\text{MLE}}, \sigma^{2})} [Z\_{i}] \right] &= \frac{1}{n} \sum\_{i=1}^{n} \mathbb{E}\_{(\mu, \sigma^{2})} \left[ \exp \{ \hat{\mu}^{\text{MLE}} \} \right] \exp \{ \sigma^{2}/2 \} \\ &= \exp \left\{ \mu + (1 + n^{-1}) \sigma^{2}/2 \right\} \\ &> \exp \left\{ \mu + \sigma^{2}/2 \right\} \ &= \frac{1}{n} \sum\_{i=1}^{n} \mathbb{E}\_{(\mu, \sigma^{2})} \left[ Z\_{i} \right]. \end{split}$$

Volatility in parameter estimation *<sup>μ</sup>*MLE leads to a positive bias in this case. Note that we have assumed full knowledge of the distributional model (i.i.d. log-normal) and the nuisance parameter *σ*<sup>2</sup> in this calculation. If, for instance, we do not know the true nuisance parameter and we work with (deterministic) \**σ*<sup>2</sup> " *<sup>σ</sup>*<sup>2</sup> and *n >* 1, we can get a negative bias

$$\begin{split} \frac{1}{n} \sum\_{i=1}^{n} \mathbb{E}\_{(\mu, \sigma^{2})} \left[ \mathbb{E}\_{(\widetilde{\mu}^{\text{MLE}}, \widetilde{\sigma}^{2})} [Z\_{i}] \right] &= \frac{1}{n} \sum\_{i=1}^{n} \mathbb{E}\_{(\mu, \sigma^{2})} \left[ \exp \{ \widehat{\mu}^{\text{MLE}} \} \right] \exp \{ \widetilde{\sigma}^{2} / 2 \} \\ &= \exp \left\{ \mu + \sigma^{2} / (2n) + \widetilde{\sigma}^{2} / 2 \right\} \\ &< \exp \left\{ \mu + \sigma^{2} / 2 \right\} \ \ = \frac{1}{n} \sum\_{i=1}^{n} \mathbb{E}\_{(\mu, \sigma^{2})} \left[ Z\_{i} \right]. \end{split}$$

This shows that working on the log-scale is rather difficult because the backtransformation is far from being trivial, and for unknown nuisance parameter not even the sign of the bias is clear. Similar considerations apply to the frequently used Box–Cox transformation [48] for *χ* = 1

$$Z\_i \mapsto \ Y\_i = \frac{Z\_i^{\chi} - 1}{\chi}.$$

For this reason, if unbiasedness is a central requirement (like in insurance pricing) non-linear transformations should only be used with great care (and only if necessary).

## **5.4 Quasi-Likelihoods**

Above we have been mentioning the notion of over-dispersed Poisson models. This naturally leads to so-called quasi-Poisson models and quasi-likelihoods. The framework of quasi-likelihoods has been introduced by Wedderburn [376]. In this section we give the main idea behind quasi-likelihoods, and for a more detailed treatment and mathematical results we refer to Chapter 8 of McCullagh–Nelder [265].

In Sect. 5.1.4 we have discussed the estimation of GLMs. This has been based on the explicit knowledge of the full log-likelihood function *<sup>Y</sup> (β)* for given data *Y*. This has allowed us to calculate the score equations *s(β, Y)* = ∇*<sup>β</sup> <sup>Y</sup> (β)* = 0 whose solutions (Z-estimators) contain the MLE for *β*. The solutions of the score equations themselves, using Fisher's scoring method, no longer need the explicit functional form of the log-likelihood, but they are only based on the first and second moments, see (5.9) and Remarks 5.4. Thus, all models where these first two moments coincide will provide the same MLE for the regression parameter *β*; this is also the explanation behind the IRLS algorithm. Moreover, the first two moments are sufficient for prediction and uncertainty quantification based on mean squared errors, and they are also sufficient to quantify asymptotic normality. This is exactly what motivates the quasi-likelihood considerations, and these considerations are also related to the quasi-generalized pseudo maximum likelihood estimator (QPMLE) that we are going to discuss in Theorem 11.8, below.

Assume that *<sup>Y</sup>* is a random vector having first moment *<sup>μ</sup>* <sup>∈</sup> <sup>R</sup>*n*, positive definite variance function *V (μ)* <sup>∈</sup> <sup>R</sup>*n*×*<sup>n</sup>* and dispersion parameter *<sup>ϕ</sup>*. The quasi- (log-)likelihood function *<sup>Y</sup> (μ)* assumes that its gradient is given by

$$\nabla\_{\mu}\ell\_Y(\mu) = \frac{1}{\varphi}V(\mu)^{-1}\left(Y - \mu\right).$$

In case of a diagonal variance function *V (μ)* this relates to the score (5.9). The remaining step is to model the mean parameter *<sup>μ</sup>* <sup>=</sup> *<sup>μ</sup>(β)* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* as a function of a lower dimensional regression parameter *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+1, we also refer to Fig. 5.2. For this last step we assume that the Jacobian *<sup>B</sup>* <sup>∈</sup> <sup>R</sup>*n*×*(q*+1*)* of *<sup>d</sup>μ/d<sup>β</sup>* has full rank *q* + 1. The score equations for *β* and given observations *Y* then read as

$$\frac{1}{\varphi}B^\top V(\mu(\mathfrak{J}))^{-1} \left(Y - \mu(\mathfrak{J})\right) = 0.$$

This is of exactly the same structure as the score equations in Proposition 5.1, and the roots are found by using the IRLS algorithm for *t* ≥ 0, see (5.12),

$$
\widehat{\mathfrak{B}}^{(l)} \mapsto \quad \widehat{\mathfrak{B}}^{(l+1)} = \left(B^{\top}V(\widehat{\mathfrak{A}}^{(l)})^{-1}B\right)^{-1}B^{\top}V(\widehat{\mathfrak{A}}^{(l)})^{-1}\left(B\widehat{\mathfrak{B}}^{(l)} + Y - \widehat{\mathfrak{A}}^{(l)}\right),
$$

where *<sup>μ</sup>(t )* <sup>=</sup> *<sup>μ</sup>( <sup>β</sup>(t ))*.

We conclude with the following points about quasi-likelihoods:


$$V(\mu) = \text{diag}(V(\mu\_1), \dots, V(\mu\_n)),$$

then, the choice of the variance function *μ* → *V (μ)* describes the explicit selection of the quasi-likelihood model. If we choose the power variance function *V (μ)* <sup>=</sup> *<sup>μ</sup>p*, *<sup>p</sup>* ∈ *(*0*,* <sup>1</sup>*)*, we have a quasi-Tweedie's model.


## **5.5 Double Generalized Linear Model**

In the derivations above we have treated the dispersion parameter *ϕ* in the GLM as a nuisance parameter. In the case of a homogeneous dispersion parameter it can be canceled in the score equations for MLE, see (5.9). Therefore, it does not influence MLE, and in a subsequent step this nuisance parameter can still be estimated using, e.g., Pearson's or deviance residuals, see Sect. 5.3.1 and Remark 5.26. In some examples we may have systematic effects in the dispersion parameter, too. In this case the above approach will not work because a heterogeneous dispersion parameter no longer cancels in the score equations. This has been considered in Smyth [341] and Smyth–Verbyla [343]. The heterogeneous dispersion situation is of general interest for GLMs, and it is of particular interest for Tweedie's CP GLM if we interpret Tweedie's distribution [358] as a CP model with i.i.d. gamma claim sizes, see Proposition 2.17; we also refer to Jørgensen–de Souza [204], Smyth– Jørgensen [342] and Delong et al. [94].

## *5.5.1 The Dispersion Submodel*

We extend model assumption (5.1) by assuming that also the dispersion parameter *ϕi* is policy *i* dependent. Assume that all random variables *Yi* are independent and have densities w.r.t. a *σ*-finite measure *ν* on R given by

$$Y\_l \sim f(\mathbf{y}\_l; \theta\_l, v\_l/\varphi\_l) = \exp\left\{ \frac{\mathbf{y}\_l \theta\_l - \kappa(\theta\_l)}{\varphi\_l/v\_l} + a(\mathbf{y}\_l; v\_l/\varphi\_l) \right\},$$

for 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*, with canonical parameters *θi* <sup>∈</sup> ˚, exposures *vi <sup>&</sup>gt;* 0 and dispersion parameters *ϕi >* 0. As in (5.5) we assume that every policy *i* is equipped with feature information *<sup>x</sup><sup>i</sup>* <sup>∈</sup> *<sup>X</sup>* such that for a given link function *<sup>g</sup>* : *<sup>M</sup>* <sup>→</sup> <sup>R</sup> we can model its mean as

$$\mathfrak{x}\_{l}\mapsto\mathfrak{g}(\mu\_{l})=\mathfrak{g}(\mu(\mathfrak{x}\_{l}))=\mathfrak{g}\left(\mathbb{E}\_{\theta(\mathfrak{x}\_{l})}\left[Y\_{l}\right]\right)=\eta\_{l}=\eta(\mathfrak{x}\_{l})=\langle\mathfrak{f},\mathfrak{x}\_{l}\rangle.\tag{5.56}$$

This provides us with log-likelihood function for observation *Y* = *(Y*1*,...,Yn)*-

$$\mathcal{B} \mapsto \ell\_Y(\mathcal{B}) = \sum\_{i=1}^n \frac{v\_i}{\varphi\_i} \left[ Y\_i h(\mu(\mathbf{x}\_i)) - \kappa \left( h(\mu(\mathbf{x}\_i)) \right) \right] + a(Y\_i; v\_i/\varphi\_i),$$

with canonical link *h* = *(κ )*−1. The difference to (5.7) is that the dispersion parameter *ϕi* now depends on the insurance policy which requires additional modeling. We choose a second strictly monotone and smooth link function *gϕ* : <sup>R</sup><sup>+</sup> <sup>→</sup> <sup>R</sup>, and we express the dispersion of policy 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>* by

$$g\_{\varphi}(\varphi\_l) = g\_{\varphi}(\varphi(\underline{z}\_l)) = \langle \underline{\mathbf{y}}, \underline{z}\_l \rangle,\tag{5.57}$$

where *z<sup>i</sup>* is the feature of policy *i*, which may potentially differ from *xi*. The rationale behind this different feature is that different information might be relevant for modeling the dispersion parameter, or feature information might be differently pre-processed compared to the response *Yi*. We now need to estimate two regression parameters *β* and *γ* in this approach on possibly differently pre-processed feature information *x<sup>i</sup>* and *z<sup>i</sup>* of policy *i*. In general, this is not easily doable because the term *a(Yi*; *vi/ϕi)* of the log-likelihood of *Yi* may have a complicated structure (or may not be available in closed form like in Tweedie's CP model).

## *5.5.2 Saddlepoint Approximation*

We reformulate the EDF density using the unit deviance d*(Y, μ)* defined in (2.25); we drop the lower index *<sup>i</sup>* for the moment. Set *<sup>θ</sup>* <sup>=</sup> *h(μ)* <sup>∈</sup> ˚ for the canonical link *h*, then

$$f(\mathbf{y};\theta,\mathbf{v}/\varphi) = \exp\left\{\frac{v}{\varphi}\left[\mathbf{y}h(\mu) - \kappa(h(\mu))\right] + a(\mathbf{y};v/\varphi)\right\}
$$

$$= \exp\left\{\frac{v}{\varphi}\left[\mathbf{y}h(\mathbf{y}) - \kappa(h(\mathbf{y}))\right] + a(\mathbf{y};v/\varphi)\right\}\exp\left\{-\frac{1}{2\varphi/v}\mathfrak{d}(\mathbf{y},\mu)\right\}
$$

$$\stackrel{\text{def.}}{=} a^\*(\mathbf{y};\omega)\,\exp\left\{-\frac{\omega}{2}\mathfrak{d}(\mathbf{y},\mu)\right\},\tag{5.58}$$

with *ω* = *v/ϕ* ∈ *W*. This corresponds to (2.27), and it brings the EDF density into a Gaussian-looking form. A general difficulty is that the term *a*∗*(y*; *ω)* may have a complicated structure or may not be given in closed form. Therefore, we consider its saddlepoint approximation; this is based on Section 3.5 of Jørgensen [203].

Suppose that we are in the absolutely continuous EDF case and that *κ* is steep. In that case *Y* ∈ *M*, a.s., and the variance function *y* → *V (y)* is well-defined for all observations *Y* = *y*, a.s. Based on Daniels [87], Barndorff-Nielsen–Cox [24] proved the following statement, see Theorem 3.10 in Jørgensen [203]: assume there exists *ω*<sup>0</sup> ∈ *W* such that for all *ω>ω*<sup>0</sup> the density (5.58) is bounded. Then, the following saddlepoint approximation is uniform on compact subsets of the support T of *Y*

$$f(\mathbf{y};\theta,\upsilon/\varphi) = \left(\frac{2\pi\varphi}{v}V(\mathbf{y})\right)^{-1/2} \exp\left\{-\frac{1}{2\varphi/v}\,\mathsf{d}(\mathbf{y},\mu)\right\} \left(1 + O\left(\varphi/v\right)\right),\tag{5.59}$$

as *ϕ/v* → 0. What makes this saddlepoint approximation attractive is that we can get rid of a complicated function *<sup>a</sup>*∗*(y*; *ω)* by a neat approximation *(* <sup>2</sup>*πϕ <sup>v</sup> V (y))*−1*/*<sup>2</sup> for sufficiently large volumes *v*, and at the same time, this does not affect the unit deviance d*(y, μ)*, preserving the estimation properties of *μ*. The discrete counterpart is given in Theorem 3.11 of Jørgensen [203].

Using saddlepoint approximation (5.59) we receive an approximate loglikelihood function

$$\ell\_Y(\mu, \varphi) \approx \frac{1}{2} \left[ -\varphi^{-1} v \mathfrak{d}(Y, \mu) - \log \left( \wp \right) \right] - \frac{1}{2} \log \left( \frac{2\pi}{v} V(Y) \right).$$

This approximation has an attractive form for dispersion estimation because it gives an approximate EDF for observation <sup>d</sup> def*.* <sup>=</sup> *<sup>v</sup>*d*(Y, μ)*, for given *<sup>μ</sup>*. Namely, for canonical parameter *<sup>φ</sup>* = −*ϕ*−<sup>1</sup> *<sup>&</sup>lt;* 0 we have approximation

$$\ell\_Y(\mu, \phi) \approx \frac{\mathfrak{d}\phi - (-\log(-\phi))}{2} - \frac{1}{2}\log\left(\frac{2\pi}{v}V(Y)\right). \tag{5.60}$$

The right-hand side has the structure of a gamma EDF for observation d with canonical parameter *φ <* 0, cumulant function *κϕ(φ)* = − log*(*−*φ)* and dispersion parameter 2. Thus, we have the structure of an approximate gamma model on the right-hand side of (5.60) with, for given *μ*,

$$\mathbb{E}\_{\phi}[\mathfrak{d}|\mu] \approx \kappa\_{\varphi}'(\phi) \, = \, -\frac{1}{\phi} \, = \, \varphi,\tag{5.61}$$

$$\text{Var}\_{\phi}(\mathfrak{d}|\mu) \approx 2\kappa\_{\varphi}^{\prime\prime}(\phi) \, \, = \, 2\frac{1}{\phi^2} \, \, = \, 2\phi^2. \,\tag{5.62}$$

These statements say that for given *μ* and assuming that the saddlepoint approximation is sufficiently accurate, d is approximately gamma distributed with shape parameter 1/2 and canonical parameter *φ* (which relates to the dispersion *ϕ* in the mean parametrization). Thus, we can estimate *φ* and *ϕ*, respectively, with a (second) GLM from (5.60), for given mean parameter *μ*.

#### *Remarks 5.28*

• The accuracy of the saddlepoint approximation is discussed in Section 3.2 of Smyth–Verbyla [343]. The saddlepoint approximation is exact in the Gaussian and the inverse Gaussian case. In the Gaussian case, we have log-likelihood

$$
\ell\_Y(\mu, \phi) = \frac{\mathfrak{d}\phi - (-\log \left( -\phi \right))}{2} - \frac{1}{2} \log \left( \frac{2\pi}{v} \right),
$$

with variance function *V (Y )* = 1. In the inverse Gaussian case, we have loglikelihood

$$\ell\_Y(\mu, \phi) = \frac{\mathfrak{d}\phi - (-\log \left( -\phi \right))}{2} - \frac{1}{2} \log \left( \frac{2\pi}{v} Y^3 \right),$$

with variance function *V (Y )* <sup>=</sup> *<sup>Y</sup>* 3. Thus, in the Gaussian case and in the inverse Gaussian case we have a gamma model for d with mean *ϕ* and shape parameter 1*/*2, for given *μ*; for a related result we also refer to Theorem 3 of Blæsild–Jensen [38]. For Tweedie's models with *p* ≥ 1, one can show that the relative error of the saddlepoint approximation is a non-increasing function of the squared coefficient of variation *<sup>τ</sup>* <sup>=</sup> *<sup>ϕ</sup> <sup>v</sup> V (y)/y*<sup>2</sup> <sup>=</sup> *<sup>ϕ</sup> <sup>v</sup> <sup>y</sup>p*−2, leading to small approximation errors if *ϕ/v* is sufficiently small; typically one requires *τ <* 1*/*3, see Section 3.2 of Smyth–Verbyla [343].


$$\ell \, \ell Y(\mu, \phi) = \frac{\phi \mathfrak{d}(Y, \mu) - \chi(\phi)}{2} - \log Y,\tag{5.63}$$

with *χ (φ)* = 2*(*log *(*−*φ)* + *φ* log*(*−*φ)* − *φ)*. For given *μ*, this is an EDF for <sup>d</sup>*(Y, μ)* with cumulant function *<sup>χ</sup>* on the effective domain *(*−∞*,* <sup>0</sup>*)*. This provides us with expected value and variance

$$\begin{aligned} \mathbb{E}\_{\phi}[\mathfrak{d}(Y,\mu)|\mu] &= \chi'(\phi) = 2\left(-\Psi(-\phi) + \log(-\phi)\right) \approx -\frac{1}{\phi}, \\\\ \mathrm{Var}\_{\phi}(\mathfrak{d}(Y,\mu)|\mu) &= 2\chi''(\phi) = 4\left(\Psi'(-\phi) - \frac{1}{-\phi}\right), \end{aligned}$$

with digamma function and the approximation exactly refers to the saddlepoint approximation; for the variance statement we also refer to Fisher's information (2.30). For receiving more accurate mean approximations one can consider higher order terms, e.g., the second order approximation is *χ (φ)* ≈ <sup>−</sup>1*/φ* <sup>+</sup> <sup>1</sup>*/(*6*φ*2*)*. In fact, from the saddlepoint approximation (5.60) and from the exact formula (5.63) we receive in the gamma case Stirling's formula

$$
\Gamma(\nu) \approx \sqrt{2\pi} \nu^{\nu - 1/2} e^{-\nu}.
$$

In the subsequent examples we will just use the saddlepoint approximation also in the gamma EDF case.

## *5.5.3 Residual Maximum Likelihood Estimation*

The saddlepoint approximation (5.60) proposes to alternate MLE of *β* for the mean model (5.56) and of *γ* for the dispersion model (5.57). Fisher's information matrix of the saddlepoint approximation (5.60) w.r.t. the canonical parameters *θ* and *φ* is given by

$$\mathcal{Z}(\theta,\phi) = -\mathbb{E}\_{\theta,\phi} \begin{pmatrix} \phi v \kappa''(\theta) & -v \left( Y - \kappa'(\theta) \right) \\ -v \left( Y - \kappa'(\theta) \right) & -\frac{1}{2} \frac{1}{\phi^2} \end{pmatrix} = \begin{pmatrix} \frac{v}{\varphi(\phi)} V(\mu(\theta)) & 0 \\ 0 & \frac{1}{2} V\_{\varphi}(\varphi(\phi)) \end{pmatrix}.$$

with variance function *Vϕ(ϕ)* <sup>=</sup> *<sup>ϕ</sup>*2, and emphasizing that we work in the canonical parametrization *(θ , φ)*. This is a positive definite diagonal matrix which suggests that the algorithm alternating the *β* and *γ* estimations will have a fast convergence. For fixed estimate *<sup>γ</sup>* we calculate estimated dispersion parameters *ϕi* <sup>=</sup> *<sup>g</sup>*−<sup>1</sup> *<sup>ϕ</sup> <sup>γ</sup> , <sup>z</sup><sup>i</sup>* of policies 1 ≤ *i* ≤ *n*, see (5.57). These then allow us to calculate diagonal working weight matrix

$$W(\mathcal{B}) = \text{diag}\left(\left(\frac{\partial g(\mu\_l)}{\partial \mu\_i}\right)^{-2} \frac{v\_l}{\widehat{\varphi}\_l} \frac{1}{V(\mu\_i)}\right)\_{1 \le l \le n} \in \mathbb{R}^{n \times n},$$

which is used in Fisher's scoring method/IRLS algorithm (5.12) to receive MLE *<sup>β</sup>*, given the estimates *( ϕi)i*. These MLEs allow us to estimate the mean parameters *μi* <sup>=</sup> *<sup>g</sup>*−<sup>1</sup> *<sup>β</sup>, <sup>x</sup><sup>i</sup>* , and to calculate the deviances

$$\mathfrak{d}\_l = \upsilon\_l \mathfrak{d}\left(Y\_l, \widehat{\mu}\_l\right) = \mathfrak{D}\upsilon\_l \left(Y\_l h\left(Y\_l\right) - \kappa\left(h\left(Y\_l\right)\right) - Y\_l h\left(\widehat{\mu}\_l\right) + \kappa\left(h\left(\widehat{\mu}\_l\right)\right)\right) \ge 0.$$

Using (5.60) we know that these deviances can be approximated by gamma distributions *(*1*/*2*,* 1*/(*2*ϕi))*. This is a single-parameter EDF with dispersion parameter 2 (as nuisance parameter) and mean parameter *ϕi*. This motivates the definition of the working weight matrix (based on the gamma EDF model)

$$W\_{\boldsymbol{\varphi}}(\boldsymbol{\mathfrak{y}}) = \text{diag}\left(\left(\frac{\partial \mathbf{g}\_{\boldsymbol{\varphi}}(\boldsymbol{\varphi}\_{l})}{\partial \boldsymbol{\varphi}\_{l}}\right)^{-2} \frac{1}{2} \frac{1}{V\_{\boldsymbol{\varphi}}(\boldsymbol{\varphi}\_{l})}\right)\_{1 \leq l \leq n} \in \mathbb{R}^{n \times n},$$

and the working residuals

$$\mathcal{R}\_{\boldsymbol{\varphi}}(\mathfrak{d}, \mathfrak{y}) = \left( \frac{\partial \mathcal{g}\_{\boldsymbol{\varphi}}(\boldsymbol{\varphi}\_{l})}{\partial \boldsymbol{\varphi}\_{l}} (\mathfrak{d}\_{l} - \boldsymbol{\varphi}\_{l}) \right)\_{1 \le l \le n}^{\top} \in \mathbb{R}^{n}.$$

Fisher's scoring method (5.12) iterates for *s* ≥ 0 the following recursion to receive *γ*

$$
\widehat{\mathfrak{P}}^{(s)} \mapsto \widehat{\mathfrak{P}}^{(s+1)} = \left(\mathfrak{Z}^{\top}W\_{\psi}(\widehat{\mathfrak{P}}^{(s)})\mathfrak{Z}\right)^{-1}\mathfrak{Z}^{\top}W\_{\psi}(\widehat{\mathfrak{P}}^{(s)})\left(\mathfrak{Z}\widehat{\mathfrak{P}}^{(s)} + \mathcal{R}\_{\psi}(\mathfrak{d}, \widehat{\mathfrak{P}}^{(s)})\right), \tag{5.64}
$$

where <sup>Z</sup> <sup>=</sup> *(z*1*,..., <sup>z</sup>n)* is the design matrix used to estimate the dispersion parameters.

## *5.5.4 Lab: Double GLM Algorithm for Gamma Claim Sizes*

We revisit the Swedish motorcycle claim size data studied in Sect. 5.3.7. We expand the gamma claim size GLM to a double GLM also modeling the systematic effects in the dispersion parameter. In a first step we need to change the parametrization of the gamma model of Sect. 5.3.7. In the former section we have modeled the average claim size *Si/ni* ∼ *(niαi, nici)*, but for applying the saddlepoint approximation we should use the reproductive form (5.44) of the gamma model. We therefore set

$$Y\_l = \mathcal{S}\_l / (n\_l \alpha\_l) \sim \Gamma(n\_l \alpha\_l, n\_l \alpha\_l c\_l). \tag{5.65}$$

The reason for the different parametrization in Sect. 5.3.7 has been that (5.65) is not directly useful if *αi* is unknown because in that case the observations *Yi* cannot be calculated. In this section we estimate *ϕi* = 1*/αi* which allows us to model (5.65); a different treatment within Tweedie's family is presented in Sect. 11.1.3. The only difficulty is to initialize the double GLM algorithm. We proceed as follows.

	- estimate the mean *μi* of *Yi* using the mean GLM (5.56) based on the observations *Y (t ) <sup>i</sup>* and the dispersion estimates *<sup>ϕ</sup>(t*−1*) <sup>i</sup>* . This provides us with *μ(t ) i* ;
	- based on the deviances d *(t ) <sup>i</sup>* <sup>=</sup> *vi*d*(Y (t ) <sup>i</sup> , <sup>μ</sup>(t ) <sup>i</sup> )*, calculate the updated dispersion estimates *<sup>ϕ</sup>(t ) <sup>i</sup>* using the dispersion GLM (5.57) and the residual MLE iteration (5.64) with the saddlepoint approximation. Set for the updated observations *Y (t*+1*) <sup>i</sup>* <sup>=</sup> *Si <sup>ϕ</sup>(t ) <sup>i</sup> /ni*.


**Table 5.15** Number of parameters, AICs, Pearson's dispersion estimate, in-sample losses, tenfold cross-validation losses and the in-sample average claim amounts of the null model (gamma intercept model) and the (double) gamma GLM

In an initial double GLM analysis we use the feature information *z<sup>i</sup>* = *x<sup>i</sup>* for the dispersion *ϕi* modeling (5.57). We choose for both GLMs the log-link which leads to concave maximization problems, see Example 5.5. Running the above double GLM algorithm converges in 4 iterations, and analyzing the resulting model we observe that we should drop the variable RiskClass from the feature *zi*. We then run the same double GLM algorithm with the feature information *x<sup>i</sup>* and the new *z<sup>i</sup>* again, and the results are presented in Table 5.15.

The considered double GLM has parameter dimensions *<sup>β</sup>* <sup>∈</sup> <sup>R</sup><sup>7</sup> and *<sup>γ</sup>* <sup>∈</sup> <sup>R</sup>6. To have comparability with AIC of Sect. 5.3.7, we evaluate AIC of the double GLM in the observations *Si/ni* (and not in *Yi*; i.e., similar to the gamma GLM). We observe that it has an improved AIC value compared to model Gamma GLM2. Thus, indeed, dispersion modeling seems necessary in this example (under the GLM2 regression structure). We do not calculate in-sample and cross-validation losses in the double GLM because in the other two models of Table 5.15 we have set *ϕ* = 1 in these statistics. However, the in-sample loss of model Gamma GLM2 with *ϕ* = 1 corresponds to the (homogeneous) deviance dispersion estimate (up to scaling *n/(n* − *(q* + 1*))*), and this in-sample loss of 1.719 can directly be compared to the average estimated dispersion *m*−<sup>1</sup> *<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *ϕi* <sup>=</sup> <sup>1</sup>*.*<sup>721</sup> (in round brackets in Table 5.15). On the downside, the double GLM has a bigger bias which needs an adjustment.

In Fig. 5.13 (lhs) we give the normal plots of model Gamma GLM2 and the double gamma GLM model. This plot is received by transforming the observations to normal quantiles using the corresponding estimated gamma models. We see quite some similarity between the two estimated gamma models. Both models seem to have similar deficiencies, i.e., dispersion modeling improves explanation of observations, however, either the regression function or the gamma distributional assumption does not fully fit the data, especially for small claims. Finally, in Fig. 5.13 (rhs) we plot the estimated dispersion parameters *ϕi* against the logged estimated means log*( μi)* (linear predictors). We observe that the estimated dispersion has a (weak) U-shape as a function of the expected claim sizes which indicates that the tails cannot fully be captured by our model. This closes this example.

*Remark 5.29* For the dispersion estimation *ϕi* we use as observations the deviances <sup>d</sup>*<sup>i</sup>* <sup>=</sup> *vi*<sup>d</sup> *(Yi, μi)*, 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*. On a finite sample, these deviances are typically biased due to the use of the estimated means *μi*. Smyth–Verbyla [343] propose the

**Fig. 5.13** (lhs) Normal plot of the fitted models Gamma GLM2 and double GLM, (rhs) estimated dispersion parameters *ϕi* against the logged estimated means log*( μi)* (the orange line gives the in-sample loss in model Gamma GLM2)

following bias correction. Consider the estimated hat matrix defined by

$$H = \,^\mathsf{W}(\widehat{\boldsymbol{\theta}}, \widehat{\boldsymbol{\chi}})^{1/2} \mathfrak{X} \left( \mathfrak{X}^{\top} \,^\mathsf{W}(\widehat{\boldsymbol{\theta}}, \widehat{\boldsymbol{\chi}}) \, \mathfrak{X} \right)^{-1} \mathfrak{X}^{\top} \,^\mathsf{W}(\widehat{\boldsymbol{\theta}}, \widehat{\boldsymbol{\chi}})^{1/2} .$$

with the diagonal work weight matrix *W ( <sup>β</sup>, <sup>γ</sup> )* depending on the estimated regression parameters *<sup>β</sup>* and *<sup>γ</sup>* through *<sup>μ</sup>* and *<sup>ϕ</sup>*. Denote the diagonal entries of the hat matrix by *(hi,i)*<sup>1</sup>≤*i*≤*n*. A bias corrected version of the deviances is received by considering observations *(*<sup>1</sup> <sup>−</sup> *hi,i)*−1d*<sup>i</sup>* <sup>=</sup> *(*<sup>1</sup> <sup>−</sup> *hi,i)*−<sup>1</sup>*vi*<sup>d</sup> *(Yi, μi)*, 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*. We will come back to the hat matrix *H* in Sect. 5.6.1, below.

## *5.5.5 Tweedie's Compound Poisson GLM*

A popular situation for applying the double GLM framework is Tweedie's CP model introduced in Sect. 2.2.3, in particular, we refer to Proposition 2.17 for the corresponding parametrization. Having claim frequency and claim sizes involved, such a model can hardly be calibrated with one single regression function and a constant dispersion parameter. An obvious choice is a double GLM, this is the proposal presented in Smyth–Jørgensen [342]. In most of the cases one chooses for both link functions *g* and *gϕ* the log-links because positivity needs to be guaranteed. This implies for the two working weight matrices of the double GLM

$$W(\mathfrak{F}) = \text{diag}\left(\mu\_i^2 \frac{\upsilon\_i}{\varphi\_i} \frac{1}{V(\mu\_i)}\right)\_{1 \le i \le n} = \text{diag}\left(\mu\_i^{2-p} \frac{\upsilon\_i}{\varphi\_i}\right)\_{1 \le i \le n},$$

$$W\_{\boldsymbol{\varphi}}(\boldsymbol{\mathfrak{y}}) = \text{diag}\left(\mu\_i^2 \frac{1}{2} \frac{1}{V\_{\boldsymbol{\varphi}}(\varphi\_i)}\right)\_{1 \le i \le n} = \text{diag}(1/2, \dots, 1/2).$$

The deviances in Tweedie's CP model are given by, see (4.18),

$$\mathfrak{d}\_{l} = \upsilon\_{l} \mathfrak{d}\left(Y\_{l}, \widehat{\mu}\_{l}\right) = 2\upsilon\_{l} \left( Y\_{l} \frac{Y\_{l}^{1-p} - \widehat{\mu}\_{l}^{1-p}}{1-p} - \frac{Y\_{l}^{2-p} - \widehat{\mu}\_{l}^{2-p}}{2-p} \right) \ge 0,$$

and these deviances could still be de-biased, see Remark 5.29. The working responses for the two GLMs are

$$\mathbf{R} = (Y\_l/\mu\_l - 1)^{\top}\_{\mathbf{l} \le \mathbf{l} \le n} \qquad \text{and} \qquad \mathbf{R}\_{\boldsymbol{\varphi}} = (\mathfrak{d}\_l/\varphi\_l - 1)^{\top}\_{\mathbf{l} \le \mathbf{l} \le n} \dots$$

The drawback of this approach is that it only considers the (scaled) total claim amounts *Yi* = *Siϕi/vi* as observations, see Proposition 2.17. These total claim amounts consist of the number of claims *Ni* and i.i.d. individual claim sizes *Zi,j* ∼ *(α, ci)*, supposed *Ni* ≥ 1. Having observations of both claim amounts *Si* and claim counts *Ni* allows one to build a Poisson GLM for claim counts and a gamma GLM for claim sizes which can be estimated separately. This has also been the reason of Smyth–Jørgensen [342] to enhance Tweedie's model estimation for known claim counts in their Section 4. Moreover, in Theorem 4 of Delong et al. [94] it is proved that the two GLM approaches can be identified under log-link choices.

## **5.6 Diagnostic Tools**

In our examples we have studied several figures like AIC, cross-validation losses, etc., for model and parameter selection. Moreover, we have plotted the results, for instance, using the Tukey–Anscombe plot or the QQ plot. Of course, there are numerous other plots and tools that can help us to analyze the results and to improve the resulting models. We present some of these in this section.

## *5.6.1 The Hat Matrix*

The MLE *<sup>β</sup>*MLE satisfies at convergence of the IRLS algorithm, see (5.12),

$$
\widehat{\boldsymbol{\mathfrak{H}}}^{\text{MLE}} = \left(\boldsymbol{\mathfrak{X}}^{\top} \boldsymbol{W} (\widehat{\boldsymbol{\mathfrak{H}}}^{\text{MLE}}) \boldsymbol{\mathfrak{X}}\right)^{-1} \boldsymbol{\mathfrak{X}}^{\top} \boldsymbol{W} (\widehat{\boldsymbol{\mathfrak{H}}}^{\text{MLE}}) \left(\boldsymbol{\mathfrak{X}} \widehat{\boldsymbol{\mathfrak{H}}}^{\text{MLE}} + \boldsymbol{\mathsf{R}} (\boldsymbol{Y}, \widehat{\boldsymbol{\mathfrak{H}}}^{\text{MLE}})\right),
$$

with working residuals for *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+<sup>1</sup>

$$\mathcal{R}(Y,\mathfrak{f}) = \left(\frac{\partial \mathfrak{g}(\mu\_l)}{\partial \mu\_l}\bigg|\_{\mu\_l = \mu\_l(\mathfrak{f})} (Y\_l - \mu\_l(\mathfrak{f})) \right)\_{1 \le l \le n}^{\top} \in \mathbb{R}^n.$$

Following Section 4.2.2 of Fahrmeir–Tutz [123], this allows us to define the socalled *hat matrix*, see also Remark 5.29,

$$H = H(\widehat{\boldsymbol{\mathfrak{H}}}^{\text{MLE}}) = W(\widehat{\boldsymbol{\mathfrak{H}}}^{\text{MLE}})^{1/2} \mathfrak{X} \left(\mathfrak{X}^{\top}W(\widehat{\boldsymbol{\mathfrak{H}}}^{\text{MLE}})\mathfrak{X}\right)^{-1} \mathfrak{X}^{\top}W(\widehat{\boldsymbol{\mathfrak{H}}}^{\text{MLE}})^{1/2} \in \mathbb{R}^{n \times n},\tag{5.66}$$

recall that the working weight matrix *W (β)* is diagonal. The hat matrix *H* is symmetric and idempotent, i.e. *<sup>H</sup>*<sup>2</sup> <sup>=</sup> *<sup>H</sup>*, with trace*(H )* <sup>=</sup> rank*(H )* <sup>=</sup> *<sup>q</sup>* <sup>+</sup> 1. Therefore, *<sup>H</sup>* acts as a projection, mapping the observations \**<sup>Y</sup>* to the fitted values

$$\begin{split} \widetilde{Y} \stackrel{\text{def.}}{=} W(\widehat{\boldsymbol{\mathfrak{P}}}^{\text{MLE}})^{1/2} \left( \mathfrak{X} \widehat{\boldsymbol{\mathfrak{P}}}^{\text{MLE}} + \mathsf{R}(Y, \widehat{\boldsymbol{\mathfrak{P}}}^{\text{MLE}}) \right) &\mapsto \boldsymbol{H} \widetilde{\boldsymbol{Y}} = W(\widehat{\boldsymbol{\mathfrak{P}}}^{\text{MLE}})^{1/2} \mathfrak{X} \widehat{\boldsymbol{\mathfrak{P}}}^{\text{MLE}} \\ &= W(\widehat{\boldsymbol{\mathfrak{P}}}^{\text{MLE}})^{1/2} \widehat{\boldsymbol{\mathfrak{P}}}, \end{split}$$

the latter being the fitted linear predictors. The diagonal elements *hi,i* of this hat matrix *H* satisfy 0 ≤ *hi,i* ≤ 1, and values close to 1 correspond to extreme data points *i*, in particular, for *hi,i* = 1 only observation *Y* \**<sup>i</sup>* influences *ηi*, whereas for *hi,i* = 0 observation *Y* \**<sup>i</sup>* has no influence on *ηi*.

Figure 5.14 gives the resulting hat matrices of the double gamma GLM of Sect. 5.5.4. On the left-hand side we show the diagonal entries *hi,i* of the claim

**Fig. 5.14** Diagonal entries *hi,i* of the two hat matrices of the example in Sect. 5.5.4: (lhs) for means *μi* and responses *Yi*, and (rhs) for dispersions *ϕi* and responses <sup>d</sup>*<sup>i</sup>*

amount responses *Yi* (for the estimation of *μi*), and on the right-hand side the corresponding plots for the deviance responses d*<sup>i</sup>* (for the estimation of *ϕi*). These diagonal elements *hi,i* are ordered on the *<sup>x</sup>*-axis w.r.t. the linear predictors *ηi*. From this figure we conclude that the diagonal entries of the hat matrices are bigger for very small responses in our example, and the dispersion plot has a couple of more special observations that may require further analysis.

## *5.6.2 Case Deletion and Generalized Cross-Validation*

As a continuation of the previous subsection we can analyze the influence of an individual observation *Yi* on the estimation of regression parameter *β*. This influence is naturally measured by fitting the regression parameter based on the full data *D* and based only on the observations *L(*−*i)* = *D* \ {*Yi*}, we also refer to leave-one-out cross-validation in Sect. 4.2.2. The influence of observation *Yi* is then obtained by comparing *<sup>β</sup>*MLE and *<sup>β</sup>*MLE *(*−*i)* . Since fitting *n* different models by individually leaving out each observation *Yi* is too costly, one only explores a onestep Fisher's scoring update starting from *<sup>β</sup>*MLE that provides an approximation to *<sup>β</sup>*MLE *(*−*i)* , that is,

$$\begin{split} \widehat{\mathfrak{P}}\_{(-i)}^{(1)} &= \left( \mathfrak{X}\_{(-i)}^{\top} W\_{(-i)} (\widehat{\mathfrak{P}}^{\text{MLE}}) \mathfrak{X}\_{(-i)} \right)^{-1} \mathfrak{X}\_{(-i)}^{\top} W\_{(-i)} (\widehat{\mathfrak{P}}^{\text{MLE}}) \left( \mathfrak{X} \widehat{\mathfrak{P}}^{\text{MLE}} + \mathcal{R} (Y, \widehat{\mathfrak{P}}^{\text{MLE}}) \right)\_{(-i)}, \\ &= \left( \mathfrak{X}\_{(-i)}^{\top} W\_{(-i)} (\widehat{\mathfrak{P}}^{\text{MLE}}) \mathfrak{X}\_{(-i)} \right)^{-1} \mathfrak{X}\_{(-i)}^{\top} W\_{(-i)} (\widehat{\mathfrak{P}}^{\text{MLE}})^{1/2} \widetilde{Y}\_{(-i)}, \end{split}$$

where all lower indices *(*−*i)* indicate that we drop the corresponding row or/and column from the matrices and vectors, and where \**<sup>Y</sup>* has been defined in the previous subsection. This allows us to compare *<sup>β</sup>*MLE and *<sup>β</sup>(*1*) (*−*i)* to analyze the influence of observation *Yi*.

To reformulate this approximation, we come back to the hat matrix *H* = *H ( <sup>β</sup>*MLE*)* <sup>=</sup> *(hi,j )*1≤*i,j*≤*<sup>n</sup>* defined in (5.66). It fulfills

$$W(\widehat{\mathfrak{P}}^{\text{MLE}})^{1/2} \mathfrak{X} \widehat{\mathfrak{P}}^{\text{MLE}} = H \widetilde{Y} = \left(\sum\_{j=1}^{n} h\_{1,j} \widetilde{Y}\_j, \dots, \sum\_{j=1}^{n} h\_{n,j} \widetilde{Y}\_j\right)^{\top} \in \mathbb{R}^n.$$

Thus, for predicting *Yi* we can consider the linear predictor (for the chosen link *g*)

$$\widehat{\eta\_l} = \mathfrak{g}(\widehat{\mu}\_l) = \langle \widehat{\mathfrak{B}}^{\text{MLE}}, \mathfrak{x}\_l \rangle = (\mathfrak{X}\widehat{\mathfrak{B}}^{\text{MLE}})\_l = W\_{l,i}(\widehat{\mathfrak{B}}^{\text{MLE}})^{-1/2} \sum\_{j=1}^n h\_{l,j} \widetilde{Y}\_j.$$

A computation of the linear predictor of *Yi* using the leave-one-out approximation *β(*1*) (*−*i)* gives

$$
\widehat{\eta}\_{i}^{(-i,1)} = \langle \widehat{\mathcal{B}}\_{(-i)}^{(1)}, \mathbf{x}\_{i} \rangle = \frac{1}{1 - h\_{i,l}} \widehat{\eta}\_{l} - W\_{l,i} (\widehat{\widetilde{\mathbf{g}}}^{\text{ML.E}})^{-1/2} \frac{h\_{l,i}}{1 - h\_{l,i}} \widetilde{Y}\_{l}.
$$

This allows one to efficiently calculate a leave-one-out prediction using the hat matrix *H*. This also motivates to study the *generalized cross-validation* (GCV) loss which is an approximation to leave-one-out cross-validation, see Sect. 4.2.2,

$$
\widehat{\mathfrak{D}}^{\text{GCV}} = \frac{1}{n} \sum\_{i=1}^{n} \frac{v\_i}{\varphi} \mathfrak{d}\left(Y\_i, \mathfrak{g}^{-1}(\widehat{\eta}\_i^{(-i,1)})\right) \tag{5.67}
$$

$$
= \frac{2}{n} \sum\_{i=1}^{n} \frac{v\_i}{\varphi} \left[ Y\_i h\left(Y\_i\right) - \kappa\left(h\left(Y\_i\right)\right) - Y\_i h\left(\mathfrak{g}^{-1}(\widehat{\eta}\_i^{(-i,1)})\right) + \kappa\left(h\left(\mathfrak{g}^{-1}(\widehat{\eta}\_i^{(-i,1)})\right)\right) \right].
$$

*Example 5.30 (Generalized Cross-Validation Loss in the Gaussian Case)* We study the generalized cross-validation loss <sup>D</sup> GCV in the homoskedastic Gaussian case *vi/ϕ* <sup>≡</sup> <sup>1</sup>*/σ*<sup>2</sup> with cumulant function *κ(θ )* <sup>=</sup> *<sup>θ</sup>* <sup>2</sup>*/*2 and canonical link *g(μ)* <sup>=</sup> *h(μ)* = *μ*. The generalized cross-validation loss in the Gaussian case is given by

$$
\widehat{\mathfrak{D}}^{\rm GCV} = \frac{1}{n} \sum\_{i=1}^{n} \frac{1}{\sigma^2} \left( Y\_i - \widehat{\eta}\_i^{(-i, 1)} \right)^2,
$$

with (linear) leave-one-out predictor

$$\widehat{\eta}\_{i}^{(-l,1)} = \langle \widehat{\mathcal{B}}\_{(-l)}^{(1)}, \mathbf{x}\_{i} \rangle = \sum\_{j=1, j \neq i}^{n} \frac{h\_{l,j}}{1 - h\_{l,i}} Y\_{j} = \frac{1}{1 - h\_{l,i}} \widehat{\eta}\_{l} - \frac{h\_{l,i}}{1 - h\_{l,i}} Y\_{l}.$$

This gives us generalized cross-validation loss in the Gaussian case

$$
\widehat{\mathfrak{D}}^{\text{GCV}} = \frac{1}{n} \sum\_{i=1}^{n} \frac{1}{\sigma^2} \left( \frac{Y\_i - \widehat{\eta}\_i}{1 - h\_{i,i}} \right)^2,
$$

with *β* independent hat matrix

$$H = \mathfrak{X}\left(\mathfrak{x}^\top\mathfrak{X}\right)^{-1}\mathfrak{X}^\top.$$

The generalized cross-validation loss is used, for instance, for generalized additive model (GAM) fitting where an efficient and fast cross-validation method is required to select regularization parameters. Generalized cross-validation has been introduced by Craven–Wahba [84] but these authors replaced *hi,i* by *<sup>n</sup> <sup>j</sup>*=<sup>1</sup> *hj,j /n*. It holds that *<sup>n</sup> <sup>j</sup>*=<sup>1</sup> *hj,j* <sup>=</sup> trace*(H )* <sup>=</sup> *<sup>q</sup>* <sup>+</sup> 1, thus, using this approximation we receive

$$\begin{split} \widehat{\mathfrak{D}}^{\mathrm{GCV}} &\approx \frac{1}{n} \sum\_{l=1}^{n} \frac{1}{\sigma^{2}} \left( \frac{Y\_{l} - \widehat{\eta}\_{l}}{1 - \sum\_{j=1}^{n} h\_{j,j}/n} \right)^{2} = \frac{n}{(n - (q+1))^{2}} \sum\_{l=1}^{n} \frac{(Y\_{l} - \widehat{\eta}\_{l})^{2}}{\sigma^{2}}, \\ &= \frac{n}{n - (q+1)} \frac{\widehat{\varphi}^{\mathrm{P}}}{\sigma^{2}}, \end{split}$$

with *<sup>ϕ</sup>*<sup>P</sup> being Pearson's dispersion estimate in the Gaussian model, see (5.30). -

We give a numerical example based on the gamma GLM for the claim sizes studied in Sect. 5.3.7.

*Example 5.31 (Leave-One-Out Cross-Validation)* The aim of this example is to compare the generalized cross-validation loss <sup>D</sup> GCV to the leave-one-out crossvalidation loss <sup>D</sup> loo, see (4.34), the former being an approximation to the latter. We do this for the gamma claim size model studied in Sect. 5.3.7. In this example it is feasible to exactly calculate the leave-one-out cross-validation loss because we have only 656 claims.

The results are presented in Table 5.16. Firstly, the different cross-validation losses confirm that the model slightly (in-sample) over-fits to the data, which is not a surprise when estimating 7 regression parameters based on 656 observations. Secondly, the cross-validation losses provide similar numbers with leave-one-out being slightly bigger than tenfold cross-validation, here. Thirdly, the generalized cross-validation loss <sup>D</sup> GCV manages to approximate the leave-one-out crossvalidation loss <sup>D</sup> loo very well in this example.

Table 5.17 gives the corresponding results for model Poisson GLM1 of Sect. 5.2.4. Firstly, in this example with 610'206 observations it is not feasible to calculate the leave-one-out cross-validation loss (for computational reasons). Therefore, we rely on the generalized cross-validation loss as an approximation. From the results of Table 5.17 it seems that this approximation (rather) underestimates the loss (compared to tenfold cross-validation). Indeed, this is an observation that we have made also in other examples. -




## **5.7 Generalized Linear Models with Categorical Responses**

The reader will have noticed that the discussion of GLMs in this chapter has been focusing on the single-parameter linear EDF case (5.1). In many actuarial applications we also want to study examples of the vector-valued parameter EF (2.2). We briefly discuss the categorical case since this case is frequently used.

## *5.7.1 Logistic Categorical Generalized Linear Model*

We recall the EF representation of the categorical distribution studied in Sect. 2.1.4. We choose as *ν* the counting measure on the finite set *Y* = {1*,...,k*+1}. A random variable *Y* taking values in *Y* is called categorical, and the levels *y* ∈ *Y* can either be ordinal or nominal. This motivates dummy coding of the categorical random variable *Y* providing

$$T(Y) = (\mathbb{1}\_{\{Y=1\}}, \dots, \mathbb{1}\_{\{Y=k\}})^\top \in \{0, 1\}^k,\tag{5.68}$$

thus, *k* + 1 has been chosen as reference level. For the canonical parameter *θ* = *(θ*1*,...,θk)*- <sup>∈</sup> <sup>=</sup> <sup>R</sup>*<sup>k</sup>* we have cumulant function and mean functional, respectively,

$$\kappa(\theta) = \log \left( 1 + \sum\_{j=1}^{k} e^{\theta\_j} \right), \qquad \mathbf{p} = \mathbb{E}\_{\theta} [T(Y)] = \nabla\_{\theta} \kappa(\theta) = \frac{e^{\theta}}{1 + \sum\_{j=1}^{k} e^{\theta\_j}}.$$

With these choices we receive the EF representation of the categorical distribution (set *θk*+<sup>1</sup> = 0)

$$dF(\mathbf{y}; \boldsymbol{\theta}) = \exp\left\{\theta^\top T(\mathbf{y}) - \log\left(1 + \sum\_{j=1}^k e^{\theta\_j}\right)\right\} d\boldsymbol{\nu}(\mathbf{y}) = \prod\_{l=1}^{k+1} \left(\frac{e^{\theta\_l}}{\sum\_{j=1}^{k+1} e^{\theta\_j}}\right)^{\mathbf{1}\_{\{\mathbf{y} = l\}}} d\boldsymbol{\nu}(\mathbf{y}).$$

The covariance matrix of *T (Y )* is given by

$$\nabla(\boldsymbol{\theta}) = \text{Var}\_{\boldsymbol{\theta}}(T(\boldsymbol{Y})) = \nabla^2\_{\boldsymbol{\theta}} \kappa(\boldsymbol{\theta}) = \text{diag}\,(\boldsymbol{p}) - \boldsymbol{p}\,\boldsymbol{p}^\top \in \mathbb{R}^{k \times k}.$$

Assume that we have feature information *<sup>x</sup>* <sup>∈</sup> *<sup>X</sup>* ⊂ {1} × <sup>R</sup>*<sup>q</sup>* for response variable *Y* . This allows us to lift this categorical model to a GLM. The *logistic GLM* assumes for *p* = *(p*1*,...,pk)*-<sup>∈</sup> *(*0*,* <sup>1</sup>*)<sup>k</sup>* a regression function, 1 <sup>≤</sup> *<sup>l</sup>* <sup>≤</sup> *<sup>k</sup>*,

$$\mathbf{x} \mapsto p\_l = p\_l(\mathbf{x}) = \mathbb{P}\_{\mathcal{S}}[Y = l] = \frac{\exp\langle \mathcal{S}\_l, \mathbf{x} \rangle}{1 + \sum\_{j=1}^{k} \exp\langle \mathcal{S}\_j, \mathbf{x} \rangle},\tag{5.69}$$

for regression parameter *β* = *(β*- <sup>1</sup> *,..., β*- *<sup>k</sup> )*- <sup>∈</sup> <sup>R</sup>*k(q*+1*)* . Equivalently, we can rewrite these regression probabilities relative to the reference level, that is, we consider linear predictors for 1 ≤ *l* ≤ *k*

$$\eta\_l(\mathbf{x}) = \log \left( \frac{\mathbb{P}\_{\mathbf{f}}[Y = l]}{\mathbb{P}\_{\mathbf{f}}[Y = k + 1]} \right) = \langle \mathcal{B}\_l, \mathbf{x} \rangle. \tag{5.70}$$

Note that this naturally gives us the canonical link *h* which we have already derived in Sect. 2.1.4. Define the matrix for feature *<sup>x</sup>* <sup>∈</sup> *<sup>X</sup>* ⊂ {1} × <sup>R</sup>*<sup>q</sup>*

$$X = \begin{pmatrix} \mathbf{x}^\top & 0 & 0 & \cdots & 0 \\ 0 & \mathbf{x}^\top & 0 & \cdots & 0 \\ 0 & 0 & \mathbf{x}^\top & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & \mathbf{x}^\top \end{pmatrix} \in \mathbb{R}^{k \times k(q+1)}.\tag{5.71}$$

This gives linear predictor and canonical parameter, respectively, under the canonical link *h*

$$\theta = h(\mathfrak{p}(\mathfrak{x})) = \mathfrak{y}(\mathfrak{x}) = X\mathfrak{z} = \left( \langle \mathfrak{x}\_1, \mathfrak{x} \rangle, \dots, \langle \mathfrak{z}\_k, \mathfrak{x} \rangle \right)^\top \in \mathfrak{G} = \mathbb{R}^k. \tag{5.72}$$

## *5.7.2 Maximum Likelihood Estimation in Categorical Models*

Assume we have *n* independent observations *Yi* following the logistic categorical GLM (5.69) with features *<sup>x</sup><sup>i</sup>* <sup>∈</sup> <sup>R</sup>*q*+<sup>1</sup> and *<sup>X</sup><sup>i</sup>* <sup>∈</sup> <sup>R</sup>*k*×*k(q*+1*)* , respectively, for 1 ≤ *i* ≤ *n*. The joint log-likelihood function is given by, we use (5.72),

$$\mathfrak{G} \mapsto \ell\_Y(\mathfrak{G}) = \sum\_{i=1}^n (X\_i \mathfrak{G})^\top T(Y\_i) - \kappa(X\_i \mathfrak{G}).$$

This provides us with score equations

$$\text{res}(\mathfrak{F}, Y) = \nabla\_{\mathfrak{F}} \ell\_Y(\mathfrak{F}) = \sum\_{l=1}^{n} X\_l^{\top} \left[ T(Y\_l) - \nabla\_{\mathfrak{F}} \kappa(X\_l \mathfrak{F}) \right] = \sum\_{l=1}^{n} X\_l^{\top} \left[ T(Y\_l) - p(\mathbf{x}\_l) \right] \\ = 0,$$

with logistic regression function (5.69) for *p(x)*. For the score equations with canonical link we also refer to the second case in Proposition 5.1. Next, we calculate Fisher's information matrix, we also refer to (3.16),

$$\mathcal{Z}\_n(\mathfrak{B}) = -\mathbb{E}\_{\mathfrak{B}}\left[\nabla\_{\mathfrak{B}}^2 \ell\_Y(\mathfrak{B})\right] = \sum\_{i=1}^n X\_i^\top \Sigma\_i(\mathfrak{B}) X\_i,$$

with covariance matrix of *T (Yi)*

$$\Sigma\_l(\mathfrak{g}) = \nabla^2\_{\mathfrak{g}} \kappa(X\_l \mathfrak{g}) = \text{diag}\left(\mathfrak{p}(\mathfrak{x}\_l)\right) - \mathfrak{p}(\mathfrak{x}\_l)\mathfrak{p}(\mathfrak{x}\_l)^\top.$$

We rewrite the score in a similar way as in Sect. 5.1.4. This requires for general link *g(p)* <sup>=</sup> *<sup>η</sup>* and inverse link *<sup>p</sup>* <sup>=</sup> *<sup>g</sup>*−1*(η)*, respectively, the following block diagonal matrix

$$W(\mathfrak{g}) = \text{diag}\left(\left(\left.\nabla\_{\mathfrak{g}}\operatorname{g}^{-1}(\mathfrak{q})\right|\_{\mathfrak{q}=X\_{i}\mathfrak{g}}\right)\Sigma\_{i}(\mathfrak{g})^{-1}\left(\left.\nabla\_{\mathfrak{q}}\operatorname{g}^{-1}(\mathfrak{q})\right|\_{\mathfrak{q}=X\_{i}\mathfrak{g}}\right)^{\top}\right)\_{1\leq i\leq n}$$

$$=\text{diag}\left(\left(\left.\nabla\_{\mathfrak{p}}\operatorname{g}(\mathfrak{p})\right|\_{\mathfrak{p}=\operatorname{g}^{-1}(X\_{i}\mathfrak{f})}\right)^{\top}\Sigma\_{i}(\mathfrak{f})\left(\left.\nabla\_{\mathfrak{p}}\operatorname{g}(\mathfrak{p})\right|\_{\mathfrak{p}=\operatorname{g}^{-1}(X\_{i}\mathfrak{f})}\right)\right)\_{1\leq i\leq n}^{-1},\tag{5.73}$$

and the working residuals

$$\mathcal{R}(Y,\mathcal{J}) = \left( \left( \nabla\_{\mathbf{p}} \mathbf{g}(\mathbf{p}) \big|\_{\mathbf{p} = \mathbf{g}^{-1}(X\_l \mathcal{J})} \right)^{\top} (T(Y\_l) - \mathbf{p}(\mathbf{x}\_l)) \right)\_{1 \le l \le n} \tag{5.74}$$

Because we work with the canonical link *<sup>g</sup>* <sup>=</sup> *<sup>h</sup>* and *<sup>g</sup>*−<sup>1</sup> = ∇*<sup>θ</sup> <sup>κ</sup>*, we can use the simplified block diagonal matrix

$$W(\mathfrak{f}) = \text{diag}\left(\Sigma\_1(\mathfrak{f}), \dots, \Sigma\_n(\mathfrak{f})\right) \in \mathbb{R}^{kn \times kn},$$

and the working residuals

$$\mathcal{R}(Y,\mathcal{B}) = \left(\Sigma\_l(\mathcal{B})^{-1} \left(T(Y\_l) - \mathfrak{p}(\mathfrak{x}\_l)\right)\right)\_{1 \le l \le n} \in \mathbb{R}^{kn}.$$

Finally, we define the design matrix

$$\mathfrak{X} = \begin{pmatrix} X\_1 \\ X\_2 \\ \vdots \\ X\_n \end{pmatrix} \in \mathbb{R}^{kn \times k(q+1)} \text{ .}$$

*.*

Putting everything together we receive the score equations

$$\operatorname{res}(\mathfrak{F}, Y) = \nabla\_{\mathfrak{F}} \ell\_Y(\mathfrak{F}) = \mathfrak{X}^\top W(\mathfrak{F}) \operatorname{R}(Y, \mathfrak{F}) = 0. \tag{5.75}$$

This is now exactly in the same form as in Proposition 5.1. Fisher's scoring method/IRLS algorithm then allows us to recursively calculate the MLE of *β* ∈ R*k(q*+1*)* by

$$
\widehat{\boldsymbol{\mathfrak{H}}}^{(t)} \mapsto \widehat{\boldsymbol{\mathfrak{H}}}^{(t+1)} = \left(\boldsymbol{\mathfrak{X}}^{\top} \boldsymbol{W}(\widehat{\boldsymbol{\mathfrak{H}}}^{(t)}) \boldsymbol{\mathfrak{X}}\right)^{-1} \boldsymbol{\mathfrak{X}}^{\top} \boldsymbol{W}(\widehat{\boldsymbol{\mathfrak{H}}}^{(t)}) \left(\boldsymbol{\mathfrak{X}} \widehat{\boldsymbol{\mathfrak{H}}}^{(t)} + \boldsymbol{\mathcal{R}}(\boldsymbol{Y}, \widehat{\boldsymbol{\mathfrak{H}}}^{(t)})\right).
$$

We have asymptotic normality of the MLE (under suitable regularity conditions)

$$
\widehat{\mathfrak{F}}\_n^{\mathrm{MLE}} \overset{(\mathrm{d})}{\approx} \mathcal{N}(\mathfrak{F}, \mathcal{I}\_n(\mathfrak{F})^{-1}),
$$

for large sample sizes *n*. This allows us to apply the Wald test (5.32) for backward parameter elimination. Moreover, in-sample and out-of-sample losses can be analyzed with unit deviances coming from the categorical cross-entropy loss function (4.19).

*Remarks 5.32* The above derivations have been done for the categorical distribution under the canonical link choice. However, these considerations hold true for more general links *g* within the vector-valued parameter EF. That is, the block diagonal matrix *W (β)* in (5.73) and the working residuals *R(Y, β)* in (5.74) provide score equations (5.75) for general vector-valued parameter EF examples, and where we replace the categorical probability *<sup>p</sup>* by the mean *<sup>μ</sup>* <sup>=</sup> <sup>E</sup>*<sup>β</sup>* [*T (Y )*].

## **5.8 Further Topics of Regression Modeling**

There are several special topics and tools in regression modeling that we have not discussed, yet. Some of them will be considered in selected chapters below, and some points are mentioned here, without going into detail.

## *5.8.1 Longitudinal Data and Random Effects*

The GLMs studied above have been considering cross-sectional data, meaning that we have fixed one time period *t* and studied this time period in an isolated fashion. Time-dependent extensions are called longitudinal or panel data. Consider a time series of data *(Yi,t, xi,t)* for policies 1 ≤ *i* ≤ *n* and time points *t* ≥ 1. For the prediction of response variable *Yi,t* we may then regress on the individual past history of policy *i*, given by the data

$$\mathcal{D}\_{l,t} = \left\{ Y\_{l,1}, \dots, Y\_{l,t-1}, \mathbf{x}\_{l,1}, \dots, \mathbf{x}\_{l,t} \right\} \dots$$

In particular, we may explore the distribution of *Yi,t* , conditionally given *Di,t* ,

$$Y\_{l,t}|\_{\mathcal{D}\_{l,t}} \sim \, F(\cdot|\mathcal{D}\_{l,t};\theta),$$

for canonical parameter *θ* ∈ and *F (*·|*Di,t*; *θ )* being a member of the EDF. For a GLM we choose a link function *g* and make the assumption

$$\log\left(\mathbb{E}\_{\mathcal{B}}\{Y\_{l,l}|\mathcal{D}\_{l,l}\}\right) = \langle\mathcal{B}, \mathbf{z}\_{l,l}\rangle,\tag{5.76}$$

where *<sup>z</sup>i,t* <sup>∈</sup> <sup>R</sup>*q*+<sup>1</sup> is a *(q* <sup>+</sup> <sup>1</sup>*)*-dimensional and *σ (Di,t)*-measurable feature vector, and regression parameter *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+<sup>1</sup> describes the common systematic effects across all policies 1 ≤ *i* ≤ *n*. This gives a generalized auto-regressive model, and if we have the Markov property

$$F(\cdot | \mathcal{D}\_{l,l}; \theta) \stackrel{(\mathbf{d})}{=} F(\cdot | Y\_{l,l-1}, \mathfrak{x}\_{l,l}; \theta) \qquad \text{ for all } t \ge 2 \text{ and } \theta \in \Theta,$$

we obtain a generalized auto-regressive model of order 1. These longitudinal models allow one to model experience rating, for instance, in car insurance where the past claims history directly influences the future insurance prices, we refer to Remark 5.15 on bonus-malus systems (BMS).

The next level of complexity is obtained by extending regression structure (5.76) by policy *i* specific random effects *B<sup>i</sup>* such that we may postulate

$$\log\left(\mathbb{E}\_{\mathcal{B}}\{Y\_{l,l}|\mathcal{D}\_{l,l},\mathbf{B}\_{l}\}\right) = \langle\mathcal{B},\mathbf{z}\_{l,l}\rangle + \langle\mathcal{B}\_{l},\mathbf{w}\_{l,l}\rangle,\tag{5.77}$$

with *σ (Di,t)*-measurable feature vector *wi,t* . Regression parameter *β* then describes the fixed systematic effects that are common over the entire portfolio 1 ≤ *i* ≤ *n* and *B<sup>i</sup>* describes the policy dependent random effects (assumed to be normalized <sup>E</sup>[*Bi*] = 0). Typically one assumes that *<sup>B</sup>*1*,..., <sup>B</sup><sup>n</sup>* are centered and i.i.d. Such effects are called static random effects because they are not time-dependent, and they may also be interpreted in a Bayesian sense.

Finally, extending these static random effects to dynamic random effects *Bi,t* , *t* ≥ 1, leads to so-called state-space models, the linear state-space model being the most popular example and being fitted using the Kalman filter [207].

## *5.8.2 Regression Models Beyond the GLM Framework*

There are several ways in which the GLM framework can be modified.

#### **Siblings of Generalized Linear Regression Functions**

The most common modification of GLMs concerns the regression structure, namely, that the scalar product in the linear predictor

$$\mathbf{x} \mapsto \mathbf{g}(\mu) = \boldsymbol{\eta} = \langle \boldsymbol{\mathfrak{g}}, \boldsymbol{x} \rangle,$$

is replaced by another regression function. A popular alternative is the framework of generalized additive models (GAMs). GAMs go back to Hastie–Tibshirani [181, 182] and the standard reference is Wood [384]. GAMs consider the regression functions

$$\mathbf{x} \mapsto \mathbf{g}(\mu) = \eta = \beta\_0 + \sum\_{j} \beta\_j \mathbf{s}\_j(\mathbf{x}\_j),\tag{5.78}$$

where *sj* : <sup>R</sup> <sup>→</sup> <sup>R</sup> are natural cubic splines. Natural cubic splines *sj* are obtained by concatenating cubic functions in so-called nodes. A GAM can have as many nodes in each cubic spline *sj* as there are different levels *xi,j* in the data 1 ≤ *i* ≤ *n*. In general, this leads to very flexible regression models, and to control in-sample over-fitting regularization is applied, for regularization we also refer to Sect. 6.2. Regularization requires setting a tuning parameter, and an efficient determination of this tuning parameter uses generalized cross-validation, see Sect. 5.6. Nevertheless, fitting GAMs can be very computational, already for portfolios with 1 million policies and involving 20 feature components the calibration can be very slow. Moreover, regression function (5.78) does not (directly) allow for a data driven method of finding interactions between feature components. For these reasons, we do not further study GAMs in this monograph.

A modification in the regression function that is able to consider interactions between feature components is the framework of classification and regression trees (CARTs). CARTs have been introduced by Breiman et al. [54] in 1984, and they are still used in its original form today. Regression trees aim to partition the feature space *X* into a finite number of disjoint subsets *X<sup>t</sup>* , 1 ≤ *t* ≤ *T* , such that all policies *(Yi, xi)* in the same subset *x<sup>i</sup>* ∈ *X<sup>t</sup>* satisfy a certain homogeneity property w.r.t. the regression task (and the chosen loss function). The CART regression function is then defined by

$$\mathbf{x} \mapsto \mu(\mathbf{x}) = \sum\_{t=1}^{T} \widehat{\mu}\_t \, \mathbb{1}\_{\{\mathbf{x} \in \mathcal{X}\_t\}},$$

where *μt* is the homogeneous mean estimator on *<sup>X</sup><sup>t</sup>* . These CARTs are popular building blocks for ensemble methods where different regression functions are combined, we mention random forests and boosting algorithms that mainly rely on CARTs. Random forests have been introduced by Breiman [52], and boosting has been popularized by Valiant [362], Kearns–Valiant [209, 210], Schapire [328], Freund [139] and Freund–Schapire [140]. Today boosting belongs to the most powerful predictive regression methods, we mention the XGBoost algorithm of Chen–Guestrin [71] that has won many competitions. We will not further study CARTs and boosting in these notes because these methods also have some drawbacks. For instance, resulting regression functions are not continuous nor do they easily allow to extrapolate data beyond the (observed) feature space, e.g., if we have a time component. Moreover, they are more difficult in the use of unstructured data such as text data. For more on CARTs and boosting in actuarial science we refer to Denuit et al. [100] and Ferrario–Hämmerli [125].

#### **Other Distributional Models**

The theory above has been relying on the EDF, but, of course, we could also study any other family of distribution functions. A clear drawback of the EDF is that it only considers light-tailed distribution functions, i.e., distribution functions for which the moment generating function exists around the origin. If the data is more heavy-tailed, one may need to transform this data and then use the EDF on the transformed data (with the drawback that one loses the balance property) or one chooses another family of distribution functions. Transformations have already been discussed in Remarks 2.11 and Sect. 5.3.9. Another two families of distributions that have been studied in the actuarial literature are the generalized beta of the second kind (GB2) distribution, see Venter [369], Frees et al. [137] and Chan et al. [66], and inhomogeneous phase type (IHP) distributions, see Albrecher et al. [8] and Bladt [37]. The GB2 family is a 4-parameter family, and it nests several examples such as the gamma, the Weibull, the Pareto and the Lomax distributions, see Table B1 in Chan et al. [66]. The density of the GB2 distribution is for *y >* 0 given by

$$f(\mathbf{y}; a, b, \alpha\_1, \alpha\_2) = \frac{\frac{|a|}{b} \left(\frac{\mathbf{y}}{b}\right)^{a\alpha\_1 - 1}}{B(\alpha\_1, \alpha\_2) \left(1 + \left(\frac{\mathbf{y}}{b}\right)^a\right)^{\alpha\_1 + \alpha\_2}} \tag{5.79}$$

$$= \frac{\frac{|a|}{y}}{B(\alpha\_1, \alpha\_2)} \left(\frac{\left(\frac{\mathbf{y}}{b}\right)^a}{1 + \left(\frac{\mathbf{y}}{b}\right)^a}\right)^{\alpha\_1} \left(\frac{1}{1 + \left(\frac{\mathbf{y}}{b}\right)^a}\right)^{\alpha\_2},$$

with scale parameter *b >* 0, shape parameters *<sup>a</sup>* <sup>∈</sup> <sup>R</sup> and *<sup>α</sup>*1*, α*<sup>2</sup> *<sup>&</sup>gt;* 0, and beta function

$$B(\alpha\_1, \alpha\_2) = \frac{\Gamma(\alpha\_1)\Gamma(\alpha\_2)}{\Gamma(\alpha\_1 + \alpha\_2)}.$$

Consider a modified logistic transformation of variable *<sup>y</sup>* <sup>→</sup> *<sup>z</sup>* <sup>=</sup> *(y/b)a/(*<sup>1</sup> <sup>+</sup> *(y/b)a)* <sup>∈</sup> *(*0*,* <sup>1</sup>*)*. This gives us the beta density

$$f(z; \alpha\_1, \alpha\_2) = \frac{z^{\alpha\_1 - 1}(1 - z)^{\alpha\_2 - 1}}{B(\alpha\_1, \alpha\_2)}.$$

Thus, the GB2 distribution can be obtained by a transformation of the beta distribution. The latter provides that a GB2 distributed random variable *Y* can be simulated from *<sup>Y</sup> (*d*)* <sup>=</sup> *b(Z/(*<sup>1</sup> <sup>−</sup> *Z))*1*/a* with *<sup>Z</sup>* <sup>∼</sup> Beta*(α*1*, α*2*)*.

A GB2 distributed random variable *Y* has first moment

$$\mathbb{E}\_{a,b,\alpha\_1,\alpha\_2}[Y] \, = \, \frac{B(\alpha\_1 + 1/a, \alpha\_2 - 1/a)}{B(\alpha\_1, \alpha\_2)} \, b,$$

for −*α*1*a <* 1 *< α*2*a*. Observe that for *a >* 0 we have that the survival function of *Y* is regularly varying with tail index *α*2*a >* 0. Thus, we can model Pareto-like tails with the GB2 family; for regular variation we refer to (1.3).

As proposed in Frees et al. [137], one can introduce a regression structure for *b >* 0 by choosing a log-link and setting

$$\log\left(\mathbb{E}\_{a,b,\alpha\_1,\alpha\_2}[Y]\right) = \log\left(\frac{B(\alpha\_1 + 1/a, \alpha\_2 - 1/a)}{B(\alpha\_1, \alpha\_2)}\right) + \langle \mathcal{S}, \mathfrak{x} \rangle .$$

MLE of *β* may pose some challenge because it depends on nuisance parameters *a,α*1*, α*2. In a recent paper Li et al. [251], there is a proposal to extend this GB2 regression to a composite regression model; composite models are discussed in Sect. 6.4.4, below. This closes this short section, and for more examples we refer to the literature.

## *5.8.3 Quantile Regression*

#### **Pinball Loss Function**

The GLMs introduced above aim at estimating the means *μ(x)* <sup>=</sup> <sup>E</sup>*θ (x)*[*<sup>Y</sup>* ] of random variables *Y* being explained by features *x*. Since mean estimation can be rather sensitive in situations where we have large claims, the more robust quantile regression has attracted some attention, recently. Quantile regression has been introduced by Koenker–Bassett [220]. The idea is that instead of estimating the mean *μ* of a random variable *Y* , we rather try to estimate its *τ* -quantile for given *<sup>τ</sup>* <sup>∈</sup> *(*0*,* <sup>1</sup>*)*. The *<sup>τ</sup>* -quantile is given by the generalized inverse *<sup>F</sup>* <sup>−</sup>1*(τ )* of the distribution function *F* of *Y* , that is,

$$F^{-1}(\mathbf{r}) = \inf \left\{ \mathbf{y} \in \mathbb{R}; \ F(\mathbf{y}) \ge \mathbf{r} \right\}.\tag{5.80}$$

Consider the *pinball loss function* for *<sup>y</sup>* <sup>∈</sup> <sup>C</sup> (convex closure of the support of *<sup>Y</sup>* ) and actions *<sup>a</sup>* <sup>∈</sup> <sup>A</sup> <sup>=</sup> <sup>R</sup>

$$L(\mathbf{y}, a) \mapsto L\_{\mathbf{r}}(\mathbf{y}, a) = (\mathbf{y} - a) \left( \mathbf{r} - \mathbb{1}\_{\{\mathbf{y} - a < 0\}} \right) \tag{5.81} \quad \tag{5.81}$$

This provides us with the expected loss for *<sup>Y</sup>* <sup>∼</sup> *<sup>F</sup>* and action *<sup>a</sup>* <sup>∈</sup> <sup>A</sup>

$$\begin{split} \mathbb{E}\_{F}\left[L\_{\mathbf{r}}(Y,a)\right] &= \mathbb{E}\_{F}\left[\left(Y-a\right)\left(\mathbf{\tau}-\mathbb{1}\_{\{Y$$

The aim is to find an optimal action *a(F )* that minimizes this expected loss, see (4.24),

$$
\widehat{a}(F) \in \mathfrak{A}(F) = \underset{a \in \mathcal{A}}{\arg\min} \mathbb{E}\_F \left[ L\_t(Y, a) \right].
$$

Note that for the time being we do not know whether the solution to this minimization problem is a singleton. For this reason, we state the solution (subject to existence) as a set-valued functional A, see (4.25).

We calculate the score equation of the expected loss using the Leibniz rule

$$\frac{\partial}{\partial a} \mathbb{E}\_F \left[ L\_\mathbf{r}(Y, a) \right] = - (\tau - \mathbf{l}) \int\_{-\infty}^a dF(\mathbf{y}) - \tau \int\_a^\infty dF(\mathbf{y})$$

$$= - (\tau - \mathbf{l}) F(a) - \tau \left( \mathbf{l} - F(a) \right) \ = F(a) - \tau \stackrel{!}{=} 0.$$

Assume the distribution *<sup>F</sup>* is continuous. This implies *F (F* <sup>−</sup>1*(τ ))* <sup>=</sup> *<sup>τ</sup>* , and we have

$$F^{-1}(\tau) \in \mathfrak{A}(F) = \underset{a \in \mathbb{A}}{\text{arg}\min} \mathbb{E}\_F \left[ L\_\mathfrak{r}(Y, a) \right].$$

In fact, using the pinball loss, we have just seen that the *τ* -quantile is elicitable within the class of continuous distributions, see Definition 4.18.

For a more general result we need a more general definition of a (set-valued) *τ* -quantile

$$\mathcal{Q}\_{\mathbf{f}}(F) = \left\{ \mathbf{y} \in \mathbb{R} ; \ \lim\_{z \uparrow \mathbf{y}} F(z) \le \mathfrak{r} \le F(\mathbf{y}) \right\}. \tag{5.82}$$

This defines a closed interval and its lower endpoint corresponds to the generalized inverse *F* <sup>−</sup>1*(τ )* given in (5.80). In complete analogy to Theorem 4.19 on the elicitability of the mean functional, we have the following statement for the *τ* quantile; this result goes back to Thomson [351] and Saerens [326].

**Theorem 5.33 (Gneiting [162, Theorem 9], Without Proof)** *Let F be the class of distribution functions on an interval* <sup>C</sup> <sup>⊆</sup> <sup>R</sup> *and choose quantile level <sup>τ</sup>* <sup>∈</sup> *(*0*,* <sup>1</sup>*).*

• *The τ -quantile* (5.82) *is elicitable relative to F.*

• *Assume the loss function <sup>L</sup>* : <sup>C</sup> <sup>×</sup> <sup>A</sup> <sup>→</sup> <sup>R</sup><sup>+</sup> *satisfies (L0)-(L2) on page <sup>92</sup> for interval* <sup>C</sup> <sup>=</sup> <sup>A</sup> <sup>⊆</sup> <sup>R</sup>*. <sup>L</sup> is consistent for the <sup>τ</sup> -quantile* (5.82) *relative to the class <sup>F</sup> of compactly supported distributions on* <sup>C</sup> *if and only if <sup>L</sup> is of the form*

$$L(\mathbf{y}, a) = (G(\mathbf{y}) - G(a)) \left( \tau - \mathbb{1}\_{\{\mathbf{y} - a < 0\}} \right),$$

*for a non-decreasing function G on* C*.*

• *If <sup>G</sup> is strictly increasing on* <sup>C</sup> *and if* <sup>E</sup>*<sup>F</sup>* [*G(Y )*] *exists and is finite for all <sup>F</sup>* <sup>∈</sup> *F, then the above loss function L is strictly consistent for the τ -quantile* (5.82) *relative to the class F.*

Theorem 5.33 characterizes the strictly consistent loss functions for quantile estimation, the pinball loss being the special case *G(y)* = *y*.

#### **Quantile Regression**

The idea behind quantile regression is that we build a regression model for the *τ* quantile. Assume we have a datum *(Y, x)* whose conditional *τ* -quantile, given *x* ∈ {1} × <sup>R</sup>*<sup>q</sup>* , can be described by the regression function

$$\mathbf{x} \mapsto \text{ g}\left(F\_{Y|\mathbf{x}}^{-1}(\mathbf{r})\right) = \langle \mathfrak{B}\_{\mathbf{r}}, \mathbf{x} \rangle,$$

for a strictly monotone and smooth link function *<sup>g</sup>* : <sup>C</sup> <sup>→</sup> <sup>R</sup>, and for a regression parameter *<sup>β</sup><sup>τ</sup>* <sup>∈</sup> <sup>R</sup>*q*+1. The aim now is to estimate this regression parameter from independent data *(Yi, xi)*, 1 ≤ *i* ≤ *n*. The pinball loss *Lτ* , given in (5.81), provides us with the following optimization problem

$$\widehat{\mathfrak{B}}\_{\mathfrak{r}} := \underset{\mathfrak{B} \in \mathbb{R}^{q+1}}{\arg\min} \sum\_{i=1}^{n} L\_{\mathfrak{r}} \left( Y\_i, \operatorname{g}^{-1} \langle \mathfrak{B}, \operatorname{x}\_i \rangle \right).$$

This then allows us to estimate the corresponding *τ* -quantile as a function of the feature information *x*. For *τ* = 1*/*2 we estimate the median by

$$
\widehat{F}\_{Y|\mathbf{x}}^{-1}(1/2) = \mathbf{g}^{-1}\left<\widehat{\boldsymbol{\beta}}\_{1/2}, \mathbf{x}\right> .
$$

We conclude from this short section that we can regress any quantity *a(F )* that is elicitable, i.e., for which a loss function exists that is strictly consistent for *a(F )* on *F* ∈ *F*. For more on quantile regression we refer to the monograph of Uribe– Guillén [361], and an interesting paper is Dimitriades et al. [106]. We will study quantile regression within deep networks in Chap. 11.2, below.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 6 Bayesian Methods, Regularization and Expectation-Maximization**

The previous chapter has been focusing on MLE of regression parameters within GLMs. Alternatively, we could address the parameter estimation problem within a Bayesian setting. The purpose of this chapter is to discuss the Bayesian estimation approach. This leads us to the notion of regularization within GLMs. Bayesian methods are also used in the Expectation-Maximization (EM) algorithm for MLE in the case of incomplete data. For literature on Bayesian theory we recommend Gelman et al. [157], Congdon [79], Robert [319], Bühlmann–Gisler [58] and Gilks et al. [158]. A nice historical (non-mathematical) review of Bayesian methods is presented in McGrayne [266]. Regularization is discussed in the book of Hastie et al. [184], and a good reference for the EM algorithm is McLachlan–Krishnan [267].

## **6.1 Bayesian Parameter Estimation**

The Bayesian estimator has been introduced in Definition 3.6. Assume that the observation *Y* has independent components *Yi* that can be described by a GLM with link function *<sup>g</sup>* and regression parameter *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+1, i.e., the random variables *Yi* have densities

$$Y\_i \stackrel{\text{ind.}}{\sim} f(\mathbf{y}; \boldsymbol{\mathfrak{f}}, \mathbf{x}\_i, \boldsymbol{v}\_i/\varphi) = \exp\left\{ \frac{\mathbf{y} (\boldsymbol{h} \circ \mathbf{g}^{-1}) \langle \boldsymbol{\mathfrak{f}} \boldsymbol{\mathfrak{f}}, \mathbf{x}\_i \rangle - (\boldsymbol{\kappa} \circ \boldsymbol{h} \circ \mathbf{g}^{-1}) \langle \boldsymbol{\mathfrak{f}} \boldsymbol{\mathfrak{f}}, \mathbf{x}\_i \rangle}{\varphi/v\_i} + a(\mathbf{y}; \boldsymbol{v}\_i/\varphi) \right\}, \quad i = 1, \ldots, n$$

with canonical link *h* = *(κ )*−1. In a Bayesian approach one models the regression parameter *β* with a prior distribution<sup>1</sup> *π(β)* on the parameter space R*q*+1, and the independence assumption between the components of *Y* needs to be understood

<sup>1</sup> Often, in Bayesian arguing, distribution and density is used in an interchangeable (and not fully precise) way, and it is left to the reader to give the right meaning to *π*.

M. V. Wüthrich, M. Merz, *Statistical Foundations of Actuarial Learning and its Applications*, Springer Actuarial, https://doi.org/10.1007/978-3-031-12409-9\_6

conditionally, given the regression parameter *β*. In other words, all observations *Yi* share the same regression parameter *β*, which itself is modeled by a prior distribution *π*.

The joint density of *Y* and *β* is given by

$$p(\mathbf{y}, \boldsymbol{\theta}) = \left(\prod\_{l=1}^{n} f(\mathbf{y}\_l; \boldsymbol{\theta}, \mathbf{x}\_l, v\_l/\boldsymbol{\varphi})\right) \pi(\boldsymbol{\theta}) = \exp\left\{\ell\_{\mathbf{Y}=\mathbf{y}}(\boldsymbol{\theta}) + \log \pi(\boldsymbol{\theta})\right\}.\tag{6.1}$$

For the given observation *Y*, this allows us to calculate the posterior density of *β* using Bayes' rule

$$\pi(\mathfrak{f}|Y) = \frac{p(Y,\mathfrak{f})}{\int p(Y,\widetilde{\mathfrak{f}})d\widetilde{\mathfrak{f}}} \propto \left(\prod\_{l=1}^{n} f(Y\_l; \mathfrak{f}, \mathfrak{x}\_l, v\_l/\varphi)\right) \pi(\mathfrak{f}),\tag{6.2}$$

where the proportionality sign ∝ indicates that we have dropped the terms that do not depend on *β*. Thus, the functional form in *β* of the posterior density *π(β*|*Y)* is fully determined by the joint density *p(Y, β)*, and the remaining term is a normalization to obtain a proper probability distribution. In many situations, the knowledge of the functional form of the posterior density in *β* is sufficient to perform Bayesian parameter estimation, at least, numerically. We will give some references, below.

The Bayesian estimator for *β* is given by the posterior mean (supposed it exists)

$$\widehat{\mathfrak{F}}^{\text{Bayes}} = \mathbb{E}\_{\pi} \, [\mathfrak{F} | Y] = \int \mathfrak{F} \, \pi(\mathfrak{F} | Y) d\nu(\mathfrak{F}) .$$

If we want to calculate the expectation of a new random variable *Yn*+<sup>1</sup> that is conditionally, given *β*, independent of *Y* and follows the same GLM as *Y*, we can directly calculate, using the tower property and conditional independence,<sup>2</sup>

$$\begin{aligned} \mathbb{E}\_{\pi} \left[ \left. Y\_{n+1} \right| Y \right] &= \mathbb{E}\_{\pi} \left[ \mathbb{E} \left[ \left. Y\_{n+1} \right| \left. \mathfrak{F}, Y \right] \right] Y \right] &= \mathbb{E}\_{\pi} \left[ \mathbb{E} \left[ \left. Y\_{n+1} \right| \left. \mathfrak{F} \right] \right] Y \right] \\ &= \mathbb{E}\_{\pi} \left[ \left. g^{-1} \langle \mathfrak{F}, \mathfrak{x}\_{n+1} \rangle \right| Y \right] &= \int g^{-1} \langle \mathfrak{F}, \mathfrak{x}\_{n+1} \rangle \, \pi(\mathfrak{F}|Y) d\upsilon(\mathfrak{F}), \end{aligned}$$

supposed that this first moment exists and that *xn*+<sup>1</sup> is the feature of *Yn*+1. We see that it all boils down to have sufficiently explicit knowledge about the posterior density *π(β*|*Y)* given in (6.2).

*Remark 6.1 (Conditional MSEP)* Based on the assumption that the posterior distribution *π(β*|*Y)* can be determined, we can analyze the GL. In a Bayesian setup one

<sup>2</sup> Note that we identify probabilities <sup>P</sup>*<sup>β</sup>* [·] = <sup>P</sup>[·|*β*] for given *<sup>β</sup>*.

usually does not calculate the MSEP as described in Theorem 4.1, but one rather studies the conditional MSEP, conditioned exactly on the collected information *Y*. That is,

$$\begin{split} \mathbb{E}\_{\pi} \left[ \left( Y\_{n+1} - \mathbb{E}\_{\pi} \left[ Y\_{n+1} \mid Y \right] \right)^{2} \middle| Y \right] &= \operatorname{Var}\_{\pi} \left( Y\_{n+1} \mid Y \right) \\ &= \operatorname{Var}\_{\pi} \left( \mathbb{E} \left[ Y\_{n+1} \mid \mathcal{B}, Y \right] \mid Y \right) + \mathbb{E}\_{\pi} \left[ \operatorname{Var} \left( Y\_{n+1} \mid \mathcal{B}, Y \right) \mid Y \right] \\ &= \operatorname{Var}\_{\pi} \left( \operatorname{g}^{-1} \langle \mathcal{B}, \mathbf{x}\_{n+1} \rangle \middle| \, Y \right) + \frac{\operatorname{\boldsymbol{\rho}}}{\upsilon\_{n+1}} \mathbb{E}\_{\pi} \left[ \left( \kappa'' \circ h \circ \operatorname{g}^{-1} \rangle \langle \mathcal{B}, \mathbf{x}\_{n+1} \rangle \right) \bigg| \, Y \right] \\ &= \operatorname{Var}\_{\pi} \left( \operatorname{g}^{-1} \langle \mathcal{B}, \mathbf{x}\_{n+1} \rangle \middle| \, Y \right) + \frac{\operatorname{\boldsymbol{\rho}}}{\upsilon\_{n+1}} \mathbb{E}\_{\pi} \left[ \left. V \left( \operatorname{g}^{-1} \langle \mathcal{B}, \mathbf{x}\_{n+1} \rangle \right) \right| \, Y \right], \end{split}$$

where we need to assume existence of second moments. Similar to Theorem 4.1, the first term is the estimation variance (in a Bayesian setting) and the second term is the average process variance (using the EDF variance function *μ* → *V (μ)*).

The remaining difficulty is the calculation of the posterior expectation of functions of *β*, based on posterior density (6.2). In very well-designed experiments the posterior density *π(β*|*Y)* can be determined explicitly, for instance, in the homogeneous EDF case with so-called conjugate priors, see Chapter 2 in Bühlmann–Gisler [58]. But in most cases, there is no closed from solution for the posterior distribution. Major progress in Bayesian modeling has been made with the emergence of computational methods like the Markov chain Monte Carlo (MCMC) method, Gibbs sampling, the Metropolis–Hastings (MH) algorithm [185, 274], sequential Monte Carlo (SMC) sampling, non-linear particle filters, and the Hamilton Monte Carlo (HMC) algorithm. These methods help us to empirically approximate the posterior density *π(β*|*Y)* in different modeling setups. These methods have in common that the explicit knowledge of the normalizing constant in (6.2) is not necessary, but it suffices to know the functional form in *β* of the posterior density *π(β*|*Y)*.

For a detailed description of MCMC methods in general, which includes Gibbs sampling and MH algorithms, we refer to Gilks et al. [158], Green [169, 170], Johansen et al. [199]; SMC sampling and non-linear particle filters are explained in Del Moral et al. [92, 93], Johansen–Evers [199], Doucet–Johansen [111], Creal [85] and Wüthrich [389]; the HMC algorithm is described in Neal [281]. We do not present these algorithms here, but for the description of the most popular algorithms we refer to Section 4.4 in Wüthrich–Buser [392]. The reason for not presenting these algorithms here is that they still face the curse of dimensionality, which makes it difficult to use Bayesian methods for high-dimensional data sets in large models; we provide another short discussion in Sect. 11.6.3, below.

## **6.2 Regularization**

## *6.2.1 Maximal a Posterior Estimator*

In the previous section we have proposed to approximate the posterior density *π(β*|*Y)* of the regression parameter *β*, given *Y*, using MCMC methods. The posterior log-likelihood in the Bayesian GLM is given by, see (6.2),

$$\begin{aligned} \log \pi(\boldsymbol{\mathfrak{f}}|\boldsymbol{Y}) & \propto \ell\_{\boldsymbol{Y}}(\boldsymbol{\mathfrak{f}}) + \log \pi(\boldsymbol{\mathfrak{f}})\\ & \propto \sum\_{i=1}^{n} \frac{Y\_{i}(\boldsymbol{h} \circ \boldsymbol{g}^{-1}) \langle \boldsymbol{\mathfrak{f}}, \boldsymbol{x}\_{i} \rangle - (\boldsymbol{\kappa} \diamond \boldsymbol{h} \diamond \boldsymbol{g}^{-1}) \langle \boldsymbol{\mathfrak{f}}, \boldsymbol{x}\_{i} \rangle}{\varphi / v\_{i}} + \log \pi(\boldsymbol{\mathfrak{f}}). \end{aligned}$$

Compared to the classical log-likelihood function *<sup>Y</sup> (β)* for MLE, there is an additional log-density term log *π(β)* that comes from the prior distribution of *β*. Thus, the posterior log-likelihood is a balanced version of the log-likelihood *<sup>Y</sup> (β)* of the data *Y* and the prior log-density log *π(β)* of the regression parameter *β*. We interpret this as *regularization* because the prior *π* smooths extremes in the loglikelihood of the observation *Y*. This gives rise to estimate the regression parameter *β* by the so-called maximal a posterior (MAP) estimator

$$\widehat{\mathfrak{F}}^{\text{MAP}} = \underset{\mathfrak{F} \in \mathbb{R}^{q+1}}{\arg\max} \log \pi(\mathfrak{F}|Y) \\ = \underset{\mathfrak{F} \in \mathbb{R}^{q+1}}{\arg\max} \ell \mathbf{y}(\mathfrak{F}) + \log \pi(\mathfrak{F}). \tag{6.3}$$

This *π*-regularized (MAP) parameter estimation has gained much popularity because it is a useful tool to prevent the model from over-fitting under suitable prior choices. Moreover, under specific choices, it allows for parameter selection. This is especially useful in high-dimensional problems; for a reference we refer to Hastie et al. [184].

Popular choices for *<sup>π</sup>* are prior densities coming from *Lp*-norms for some *<sup>p</sup>* <sup>≥</sup> 1, that is, *π(β)* ∝ exp{−*λβ p <sup>p</sup>*} for *λ >* 0. Optimization problem (6.3) then becomes

$$\widehat{\mathfrak{F}}^{\text{MAP}} = \underset{\mathfrak{F} \in \mathbb{R}^{q+1}}{\text{arg}\,\text{max}} \,\ell\_{\mathbf{Y}}(\mathfrak{F}) - \lambda \|\mathfrak{F}\|\_{p}^{p},$$

for a fixed *regularization parameter λ >* 0 (also called tuning parameter). In practical applications we should exclude the intercept parameter *<sup>β</sup>*<sup>0</sup> <sup>∈</sup> <sup>R</sup> from regularization: if we work with the canonical link within the GLM framework we have the balance property which implies unbiasedness, see Corollary 5.7. This property gets lost if *β*<sup>0</sup> is included in the regularization term. For this reason, we set *β*<sup>−</sup> = *(β*1*,...,βq )*-<sup>∈</sup> <sup>R</sup>*<sup>q</sup>* and we let regularization only act on these components

$$\widehat{\boldsymbol{\mathfrak{P}}}^{\text{MAP}} = \widehat{\boldsymbol{\mathfrak{P}}}^{\text{MAP}}(\boldsymbol{\lambda}) \;= \operatorname\*{arg\,max}\_{\boldsymbol{\mathfrak{P}} \in \mathbb{R}^{q+1}} \frac{1}{n} \ell\_Y(\boldsymbol{\mathfrak{P}}) - \boldsymbol{\lambda} \|\boldsymbol{\mathfrak{P}}\_-\|\_p^p,\tag{6.4}$$

we also scale with the sample size *n* to make the units of the tuning parameter *λ* independent of the sample size *n*.

#### *Remarks 6.2*


$$\widehat{\boldsymbol{\beta}}^{\text{MAP}} = \underset{\boldsymbol{\beta}}{\text{arg}\,\text{max}} \; \frac{1}{n} \ell\_Y(\boldsymbol{\beta}) - \lambda \sum\_{j=1}^q t\_j^{-p} |\beta\_j|^p .$$

• Often, the features have a natural group structure *x* = *(x*0*, x*1*,..., xK)*, for instance, *<sup>x</sup><sup>k</sup>* ∈ {0*,* <sup>1</sup>}*qk* may represent dummy coding of a categorical feature component with *qk* + 1 levels. In that case regularization should equally act on all components of *<sup>β</sup><sup>k</sup>* <sup>∈</sup> <sup>R</sup>*qk* (that correspond to *<sup>x</sup>k*) because these components describe the same systematic effect. Yuan–Lin [398] proposed for this problem grouped penalties of the form

$$\widehat{\boldsymbol{\mathfrak{B}}}^{\text{MAP}} = \underset{\boldsymbol{\mathfrak{B}}}{\text{arg}\max} \; \frac{1}{n} \ell\_Y(\boldsymbol{\mathfrak{B}}) - \boldsymbol{\lambda} \sum\_{k=1}^{K} \|\boldsymbol{\mathfrak{B}}\_k\|\_2. \tag{6.5}$$

This proposal leads to sparsity, i.e., for large regularization parameters *λ* the entire *β<sup>k</sup>* may be shrunk (exactly) to zero; this is discussed in Sect. 6.2.5, below. We also refer to Section 4.3 in Hastie et al. [184], and Devriendt et al. [104] proposed this approach in the actuarial literature.

• There are more versions of regularization, e.g., in the fused LASSO approach we ensure that the first differences *βj* − *βj*−<sup>1</sup> remain small.

Our motivation for considering regularization has been inspired by Bayesian theory, but we can also come from a completely different angle, namely, we can consider a constraint optimization problem with a given budget constraint *c >* 0. That is, we can consider

$$\underset{\mathcal{B}\in\mathbb{R}^{q+1}}{\arg\max} \frac{1}{n} \ell\_Y(\mathcal{B}) \qquad\text{subject to } \|\mathcal{B}\_-\|\_P^p \le c. \tag{6.6}$$

This optimization problem can be tackled by the method of Karush, Kuhn and Tucker (KKT) [208, 228]. Optimization problem (6.4) corresponds by Lagrangian duality to the constraint optimization problem (6.6). For every *c* for which the budget constraint in (6.6) is binding *β*− *p <sup>p</sup>* = *c*, there is a corresponding regularization parameter *λ* = *λ(c)*, and, conversely, the solution of (6.4) solves (6.6) with *<sup>c</sup>* <sup>=</sup> *<sup>β</sup>*MAP <sup>−</sup> *(λ) p p*.

## *6.2.2 Ridge vs. LASSO Regularization*

We compare the two special cases of *p* = 1*,* 2 in this section, and in the subsequent Sects. 6.2.3 and 6.2.4 we discuss how these two cases can be solved numerically.

**Ridge Regularization** *p* = 2 For *p* = 2, the prior distribution *π* in (6.4) is a centered Gaussian distribution. This *L*2-regularization is called *ridge regularization* or Tikhonov regularization [353], and we have

$$\widehat{\mathfrak{F}}^{\text{ridge}} = \widehat{\mathfrak{F}}^{\text{ridge}}(\lambda) \;= \operatorname\*{arg\,max}\_{\mathfrak{F} \in \mathbb{R}^{q+1}} \frac{1}{n} \ell\_Y(\mathfrak{F}) - \lambda \sum\_{j=1}^{q} \beta\_j^2. \tag{6.7}$$

**LASSO Regularization** *p* = 1 For *p* = 1, the prior distribution *π* in (6.4) is a Laplace distribution. This *L*1-regularization is called *LASSO regularization* (least absolute shrinkage and selection operator), see Tibshirani [352], and we have

$$\widehat{\mathfrak{F}}^{\text{LassoO}} = \widehat{\mathfrak{F}}^{\text{LassoO}}(\lambda) \;= \operatorname\*{arg\,max}\_{\mathfrak{F} \in \mathbb{R}^{q+1}} \frac{1}{n} \ell\_Y(\mathfrak{F}) - \lambda \sum\_{j=1}^{q} |\beta\_j|. \tag{6.8}$$

LASSO regularization has the advantage that it shrinks (unimportant) regression components to exactly zero, i.e., LASSO regularization can be used for parameter elimination and model reduction. This is discussed in the next paragraphs.

**Ridge vs. LASSO Regularization** Ridge (*p* = 2) and LASSO (*p* = 1) regularization behave rather differently. This can be understood best by using the budget constraint (6.6) interpretation which gives us a nice geometric illustration. The crucial part is that the side constraint gives us either a budget constraint *<sup>β</sup>*−<sup>2</sup> <sup>2</sup> <sup>=</sup> *<sup>q</sup> <sup>j</sup>*=<sup>1</sup> *<sup>β</sup>*<sup>2</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>c</sup>* (squared Euclidean norm) or *<sup>β</sup>*−<sup>1</sup> <sup>=</sup> *<sup>q</sup> <sup>j</sup>*=<sup>1</sup> <sup>|</sup>*βj* | ≤ *<sup>c</sup>* (Manhattan norm). In Fig. 6.1 we illustrate these two cases, the left-hand side shows the Euclidean ball in blue color (in two dimensions) and the right-hand side shows the corresponding Manhattan square in blue color; this figure is similar to Figure 2.2 in Hastie et al. [184].

The (unconstraint) MLE *<sup>β</sup>*MLE is illustrated by the red dot in Fig. 6.1. If the red dot would lie within the blue area, the budget constraint would not be binding. In Fig. 6.1 the red dot (MLE) does not lie within the blue budget constraint, and we need to compromise in the optimality of the MLE. Assume that the loglikelihood *β* → *<sup>Y</sup> (β)* is a concave function in *β*, then we receive convex level sets {*β*; *<sup>Y</sup> (β)* <sup>≥</sup> *<sup>γ</sup>*0} around the MLE *<sup>β</sup>*MLE. The critical constant *<sup>γ</sup>*<sup>0</sup> for which this level set is tangential to the blue budget constraint exactly gives us the solution to (6.6); this solution corresponds to the yellow dots in Fig. 6.1. The crucial difference between ridge and LASSO regularization is that in the latter case the yellow dot will eventually be in the corner of the Manhattan square if we shrink the budget constraint *c* to zero. Or in other words, some of the components of *β* are set exactly equal to zero for small *c* or large *λ*, respectively; in Fig. 6.1 (rhs) this happens to the first component of *<sup>β</sup>*LASSO (under the given budget constraint *<sup>c</sup>*). In

**Fig. 6.1** Illustration of optimization problem (6.6) under a budget constraint (lhs) for *p* = 2 (Euclidean norm) and (rhs) *p* = 1 (Manhattan norm)

ridge regularization this is not the case, except for special situations concerning the position of the red MLE. Thus, ridge regression makes components of parameter estimates generally smaller, whereas LASSO shrinks some of these components exactly to zero (this also explains the name LASSO).

*Remark 6.3 (Elastic Net)* LASSO regularization faces difficulties with collinearity in feature components. In particular, if we have a group of highly correlated feature components, LASSO fails to do a grouped selection, but it selects one component and ignores the other ones. On the other hand, ridge regularization can deal with this issue. For this reason, Zou–Hastie [409] proposed the *elastic net regularization*, which uses a combined regularization term

$$\widehat{\mathfrak{F}}^{\text{elastic net}} = \underset{\mathfrak{F} \in \mathbb{R}^{q+1}}{\text{arg}\max} \frac{1}{n} \ell\_Y(\mathfrak{F}) - \lambda \left[ (1 - \alpha) \|\mathfrak{F}\|\_2^2 + \alpha \|\mathfrak{F}\|\_1 \right],$$

for some *<sup>α</sup>* <sup>∈</sup> *(*0*,* <sup>1</sup>*)*. The *<sup>L</sup>*1-term gives sparsity and the quadratic term removes the limitation on the number of selected variables, providing a grouped selection. In Fig. 6.2 we compare the elastic net regularization (orange color) to ridge and LASSO regularization (black and blue color). Ridge regularization provides a smooth strictly convex boundary (black), whereas LASSO provides a boundary that is non-differentiable in the corners (blue). The elastic net is still non-differentiable in the corners, this is needed for variable selection, and at the same time it is strictly convex between the corners which is needed for grouping.

## *6.2.3 Ridge Regression*

In this section we consider ridge regression (*p* = 2) in more detail and we provide an example. The ridge estimator *<sup>β</sup>*ridge in (6.7) is found by solving the score equations

$$\widetilde{\mathbf{x}}(\boldsymbol{\mathfrak{f}},\boldsymbol{Y}) = \nabla\_{\boldsymbol{\mathfrak{f}}} \left( \ell\_{\boldsymbol{Y}}(\boldsymbol{\mathfrak{f}}) - n\boldsymbol{\lambda} \|\boldsymbol{\mathfrak{f}}\|\_{2}^{2} \right) = \boldsymbol{\mathfrak{X}}^{\top} \boldsymbol{W}(\boldsymbol{\mathfrak{f}}) \boldsymbol{R}(\boldsymbol{Y},\boldsymbol{\mathfrak{f}}) - 2n\boldsymbol{\lambda}\boldsymbol{\mathfrak{f}}\_{-} = \boldsymbol{0},\quad(6.9)$$

note that we exclude the intercept *β*<sup>0</sup> from regularization (we use a slight abuse of notation, here), and we also refer to Proposition 5.1. The negative expected Hessian of this optimization problem is given by

$$\mathcal{L}\mathcal{I}(\mathfrak{F}) = -\mathbb{E}\_{\mathfrak{F}}\left[\nabla\_{\mathfrak{F}}^{2}\left(\ell\_{Y}(\mathfrak{F}) - n\lambda \|\mathfrak{F}\_{-}\|\_{2}^{2}\right)\right] = \mathcal{Z}(\mathfrak{F}) + 2n\lambda \text{diag}(0, 1, \dots, 1) \in \mathbb{R}^{(q+1)\times(q+1)},$$

where *<sup>I</sup>(β)* <sup>=</sup> <sup>X</sup>-*W (β)*X is Fisher's information matrix of the unconstraint MLE problem. This provides us with Fisher's scoring updates for *t* ≥ 0, see (5.13),

$$
\widehat{\mathfrak{F}}^{(t)} \mapsto \widehat{\mathfrak{F}}^{(t+1)} = \widehat{\mathfrak{F}}^{(t)} + \mathcal{I}(\widehat{\mathfrak{F}}^{(t)})^{-1} \,\,\widetilde{s}(\widehat{\mathfrak{F}}^{(t)}, Y). \tag{6.10}
$$

**Lemma 6.4** *Fisher's scoring update* (6.10) *can be rewritten as follows*

$$
\widehat{\boldsymbol{\beta}}^{(t)} \mapsto \widehat{\boldsymbol{\beta}}^{(t+1)} = \mathcal{I}(\widehat{\boldsymbol{\beta}}^{(t)})^{-1} \mathfrak{X}^{\top} W(\widehat{\boldsymbol{\beta}}^{(t)}) \left( \mathfrak{X} \widehat{\boldsymbol{\beta}}^{(t)} + \mathbf{R}(\boldsymbol{Y}, \widehat{\boldsymbol{\beta}}^{(t)}) \right).
$$

*Proof* A straightforward calculation shows

$$\begin{split} \widehat{\boldsymbol{\mathfrak{H}}}^{(t+1)} &= \widehat{\boldsymbol{\mathfrak{H}}}^{(t)} + \mathcal{I}(\widehat{\boldsymbol{\mathfrak{H}}}^{(t)})^{-1} \, \widetilde{\boldsymbol{s}}(\widehat{\boldsymbol{\mathfrak{H}}}^{(t)}, \boldsymbol{Y}) \\ &= \mathcal{I}(\widehat{\boldsymbol{\mathfrak{H}}}^{(t)})^{-1} \left( \mathcal{I}(\widehat{\boldsymbol{\mathfrak{H}}}^{(t)}) \widehat{\boldsymbol{\mathfrak{H}}}^{(t)} + \mathfrak{X}^{\top} \boldsymbol{W}(\widehat{\boldsymbol{\mathfrak{H}}}^{(t)}) \boldsymbol{\mathcal{R}}(\boldsymbol{Y}, \widehat{\boldsymbol{\mathfrak{H}}}^{(t)}) - 2n\lambda \widehat{\boldsymbol{\mathfrak{H}}}^{(t)}\_{-} \right) \\ &= \mathcal{I}(\widehat{\boldsymbol{\mathfrak{H}}}^{(t)})^{-1} \left( \mathcal{I}(\widehat{\boldsymbol{\mathfrak{H}}}^{(t)}) \widehat{\boldsymbol{\mathfrak{H}}}^{(t)} + \mathfrak{X}^{\top} \boldsymbol{W}(\widehat{\boldsymbol{\mathfrak{H}}}^{(t)}) \boldsymbol{\mathcal{R}}(\boldsymbol{Y}, \widehat{\boldsymbol{\mathfrak{H}}}^{(t)}) \right) \\ &= \mathcal{I}(\widehat{\boldsymbol{\mathfrak{H}}}^{(t)})^{-1} \mathfrak{X}^{\top} \boldsymbol{W}(\widehat{\boldsymbol{\mathfrak{H}}}^{(t)}) \left( \mathfrak{X} \widehat{\boldsymbol{\mathfrak{H}}}^{(t)} + \mathsf{R}(\boldsymbol{Y}, \widehat{\boldsymbol{\mathfrak{H}}}^{(t)}) \right). \end{split}$$

This proves the claim.

Lemma 6.4 allows us to fit a ridge regularized GLM. To determine an optimal regularization parameter *λ* ≥ 0 one uses cross-validation, in particular, generalized cross-validation is used to receive an efficient cross-validation method, see (5.67).

*Example 6.5 (Ridge Regression)* We revisit the gamma claim size example of Sect. 5.3.7, and we choose model Gamma GLM1, see Listing 5.11. This example does not consider any categorical features, but only continuous ones. We directly

**Fig. 6.3** Ridge regularized MLEs in model Gamma GLM1: (lhs) in-sample deviance losses as a function of the regularization parameter *λ >* 0, (rhs) resulting *β* ridge *<sup>j</sup> (λ)* for 1 ≤ *j* ≤ *q* = 8

apply Fisher's scoring updates (6.10). <sup>3</sup> For this analysis we center and normalize (to unit variance) the columns of the design matrix (except for the initial column of X encoding the intercept).

Figure 6.3 (lhs) shows the resulting in-sample deviance losses as a function of *λ >* 0. Regularization parameter *λ* allows us to continuously connect the in-sample deviance losses of the null model (2.085) and model Gamma GLM1 (1.717), see Table 5.13. Figure 6.3 (rhs) shows the regression parameter estimates *β* ridge *<sup>j</sup> (λ)*, 1 ≤ *j* ≤ *q* = 8, as a function of *λ >* 0. Overall they decrease because the budget constraint gets more tight for increasing *λ*, however, the individual parameters do not need to be monotone, since one parameter may (better) compensate a decrease of another (through correlations in feature components).

Finally, we need to choose the optimal regularization parameter *λ >* 0. This is done by cross-validation. We exploit the generalized cross-validation loss, see (5.67), and the hat matrix in this ridge regularized case is given by

$$H\_{\lambda} = W(\widehat{\boldsymbol{\beta}}^{\text{ridge}})^{1/2} \mathfrak{X} \; \mathcal{I}(\widehat{\boldsymbol{\beta}}^{\text{ridge}})^{-1} \mathfrak{X}^{\top} W(\widehat{\boldsymbol{\beta}}^{\text{ridge}})^{1/2} \boldsymbol{\lambda}$$

In contrast to (5.66), this hat matrix *Hλ* is not a projection but we would need to work in an augmented model to receive the projection property (accounting for the regularization part).

Figure 6.4 plots the generalized cross-validation loss as a function of *λ >* 0. We observe the minimum in parameter *<sup>λ</sup>* <sup>=</sup> *<sup>e</sup>*−9*.*4. The resulting generalized crossvalidation loss is 1.76742. This is bigger than the one received in model Gamma

<sup>3</sup> The R command glmnet [142] allows for regularized MLE, however, the current version does not include the gamma distribution. Therefore, we have implemented our own routine.

GLM2, see Table 5.16, thus, we still prefer model Gamma GLM2 over the optimally ridge regularized model GLM1. Note that for model Gamma GLM2 we did variable selection, whereas ridge regression just generally shrinks regression parameters. For more interpretation we refer to Example 6.8, below, which considers LASSO regularization. -

## *6.2.4 LASSO Regularization*

In this section we consider LASSO regularization (*p* = 1). This is more challenging than ridge regularization because of the non-differentiability of the budget constraint, see Fig. 6.1 (rhs). This section follows Chapters 2 and 5 of Hastie et al. [184] and Parikh–Boyd [292].

#### **Gaussian Case**

We start with the homoskedastic Gaussian model having unit variance *<sup>σ</sup>*<sup>2</sup> <sup>=</sup> 1. In a first step, the regression model only involves one feature component *q* = 1. Thus, we aim at solving LASSO optimization

$$\widehat{\boldsymbol{\beta}}^{\text{LASSO}} = \underset{\boldsymbol{\beta} \in \mathbb{R}^2}{\text{arg}\max} \ -\frac{1}{2n} \sum\_{i=1}^n \left( Y\_i - \beta\_0 - \beta\_1 \mathbf{x}\_i \right)^2 - \lambda |\beta\_1|.$$

We standardize the observations and features *(Yi, xi)*<sup>1</sup>≤*i*≤*<sup>n</sup>* such that we have *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *Yi* <sup>=</sup> 0, *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *xi* <sup>=</sup> <sup>0</sup> and *<sup>n</sup>*−<sup>1</sup> *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>x</sup>*<sup>2</sup> *<sup>i</sup>* = 1. This implies that we can omit the intercept parameter *β*0, as the optimal intercept satisfies for this standardized data (and any *<sup>β</sup>*<sup>1</sup> <sup>∈</sup> <sup>R</sup>)

$$
\widehat{\beta\_0} = \frac{1}{n} \sum\_{i=1}^n Y\_i - \beta\_1 x\_i = 0. \tag{6.11}
$$

Thus, w.l.o.g., we assume to work with standardized data in this section, this gives us the optimization problem (we drop the lower index in *β*<sup>1</sup> because we only have one component)

$$\widetilde{\beta}^{\text{LASSO}} = \widetilde{\beta}^{\text{LASSO}}(\lambda) = \underset{\beta \in \mathbb{R}}{\text{arg}\max} \ -\frac{1}{2n} \sum\_{l=1}^{n} \left( Y\_l - \beta \mathbf{x}\_l \right)^2 - \lambda |\beta|. \tag{6.12}$$

The difficulty is that the regularization term is not differentiable in zero. Since this term is convex we can express its derivative in terms of a sub-gradient s. This provides score

$$\frac{\partial}{\partial \beta} \left( -\frac{1}{2n} \sum\_{l=1}^{n} \left( Y\_l - \beta \mathbf{x}\_l \right)^2 - \lambda |\beta| \right) = \frac{1}{n} \sum\_{l=1}^{n} \left( Y\_l - \beta \mathbf{x}\_l \right) \mathbf{x}\_l - \lambda \mathfrak{a} = \frac{1}{n} \langle \mathbf{Y}, \mathbf{x} \rangle - \beta - \lambda \mathfrak{a},$$

where we use standardization *n*−<sup>1</sup> *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>x</sup>*<sup>2</sup> *<sup>i</sup>* = 1 in the second step, *Y, x* is the scalar product of *Y, x* = *(x*1*,...,xn)*- <sup>∈</sup> <sup>R</sup>*n*, and where we consider the subgradient

$$\mathfrak{s} = \mathfrak{s}(\beta) = \begin{cases} +1 & \text{if } \beta > 0, \\ -1 & \text{if } \beta < 0, \\ \in [-1, 1] & \text{otherwise.} \end{cases}$$

Henceforth, we receive the score equation for *β* = 0

$$n^{-1}\langle Y, \mathbf{x} \rangle - \beta - \lambda \mathfrak{s} = n^{-1}\langle Y, \mathbf{x} \rangle - \beta - \text{sign}(\beta)\lambda \stackrel{!}{=} 0.1$$

This score equation has a proper solution *β >* <sup>0</sup> if *<sup>n</sup>*−<sup>1</sup>*<sup>Y</sup>, <sup>x</sup> > λ*, and it has a proper solution *β <* <sup>0</sup> if *<sup>n</sup>*−<sup>1</sup>*<sup>Y</sup>, <sup>x</sup> <sup>&</sup>lt;* <sup>−</sup>*λ*. In any other case we have a boundary solution *β* <sup>=</sup> <sup>0</sup> for our maximization problem (6.12).

This solution can be written in terms of the following *soft-thresholding operator* for *λ* ≥ 0

$$\widehat{\beta}^{\text{LASSO}} = \mathcal{S}\_{\lambda} \left( n^{-1} \langle Y, \mathbf{x} \rangle \right) \qquad \text{with} \quad \mathcal{S}\_{\lambda}(\mathbf{x}) = \text{sign}(\mathbf{x}) (|\mathbf{x}| - \lambda)\_{+}. \tag{6.13}$$

This soft-thresholding operator is illustrated in Fig. 6.5 for *λ* = 4.

This approach can be generalized to multiple feature components *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* . We standardize the observations and features *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *Yi* <sup>=</sup> 0, *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *xi,j* <sup>=</sup> <sup>0</sup> and

*n*−<sup>1</sup> *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>x</sup>*<sup>2</sup> *i,j* = 1 for all 1 ≤ *j* ≤ *q*. This allows us again to drop the intercept term and to directly consider

$$\widehat{\mathfrak{F}}^{\text{LASSO}} = \widehat{\mathfrak{F}}^{\text{LASSO}}(\lambda) = \underset{\mathfrak{F} \in \mathbb{R}^q}{\text{arg}\max} -\frac{1}{2n} \sum\_{l=1}^n \left( Y\_l - \sum\_{j=1}^q \beta\_j \mathbf{x}\_{l,j} \right)^2 - \lambda \|\mathfrak{F}\|\_1.$$

Since this is a concave (quadratic) maximization problem with a separable (convex) penalty term, we can apply a *cycle coordinate descent method* that iterates a cyclic coordinate-wise maximization until convergence. Thus, if we want to maximize in the *t*-th iteration the *j* -th coordinate of the regression parameter we consider recursively

$$\widetilde{\beta}\_j^{(t)} = \underset{\beta\_j \in \mathbb{R}}{\text{arg}\max} \ -\frac{1}{2n} \sum\_{i=1}^n \left( Y\_i - \sum\_{l=1}^{j-1} \beta\_l^{(t)} x\_{i,l} - \sum\_{l=j+1}^q \beta\_l^{(t-1)} x\_{i,l} - \beta\_j x\_{i,j} \right)^2 - \lambda |\beta\_j|.$$

Using the soft-thresholding operator (6.13) we find the optimal solution

$$
\widehat{\boldsymbol{\beta}}\_{j}^{(t)} = \mathcal{S}\_{\boldsymbol{\lambda}} \left( \boldsymbol{n}^{-1} \left\langle \mathbf{Y} - \sum\_{l=1}^{j-1} \boldsymbol{\beta}\_{l}^{(t)} \mathbf{x}\_{l} - \sum\_{l=j+1}^{q} \boldsymbol{\beta}\_{l}^{(t-1)} \mathbf{x}\_{l}, \, \mathbf{x}\_{j} \right\rangle \right),
$$

with vectors *x<sup>l</sup>* = *(x*1*,l,...,xn,l)*- <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* for 1 <sup>≤</sup> *<sup>l</sup>* <sup>≤</sup> *<sup>q</sup>*. Iteration until convergence provides the LASSO regularized estimator *<sup>β</sup>*LASSO*(λ)* for given regularization parameter *λ >* 0.

Typically, we want to explore *<sup>β</sup>*LASSO*(λ)* for multiple *<sup>λ</sup>*'s. For this, one runs a *pathwise cyclic coordinate descent method*. We start with a large value for *λ*, namely, we define

$$\lambda^{\max} = \max\_{1 \le j \le q} n^{-1} \left| \langle Y, x\_j \rangle \right|.$$

For *<sup>λ</sup>* <sup>≥</sup> *<sup>λ</sup>*max, we have *<sup>β</sup>*LASSO*(λ)* <sup>=</sup> 0, i.e., we have the null model. Pathwise cycle coordinate descent starts with this solution for *<sup>λ</sup>*<sup>0</sup> <sup>=</sup> *<sup>λ</sup>*max. In a next step, one slightly decreases *λ*<sup>0</sup> and runs the cyclic coordinate descent algorithm until convergence for this slightly smaller *<sup>λ</sup>*<sup>1</sup> *< λ*0, and with starting value *<sup>β</sup>*LASSO*(λ*0*)*. This is then iterated for *λt*+<sup>1</sup> *< λt* , *t* ≥ 0, which provides a sequence of LASSO regularized estimators *<sup>β</sup>*LASSO*(λt)* along the path *(λt)t*≥0.

For further remarks we refer to Section 2.6 in Hastie et al. [184]. This concerns statements about uniqueness for general design matrices, also in the set-up where *q>n*, i.e., where we have more parameters than observations. Moreover, references to convergence results are given in Section 2.7 of Hastie et al. [184]. This closes the Gaussian case.

#### **Gradient Descent Algorithm for LASSO Regularization**

In Sect. 7.2.3 we will discuss gradient descent methods for network fitting. In this section we provide preliminary considerations on gradient descent methods because these are also useful to fit LASSO regularized parameters within GLMs (different from Gaussian GLMs). Remark that we do a sign switch in what follows, and we aim at minimizing an objective function *g*.

Choose a convex and differentiable function *<sup>g</sup>* : <sup>R</sup>*q*+<sup>1</sup> <sup>→</sup> <sup>R</sup>. Assuming that the global minimum of *g* is achieved, a necessary and sufficient condition for the optimality of *<sup>β</sup>*<sup>∗</sup> <sup>∈</sup> <sup>R</sup>*q*+<sup>1</sup> in this convex setting is <sup>∇</sup>*βg(β)*|*β*=*β*<sup>∗</sup> <sup>=</sup> 0. *Gradient descent algorithms* find this optimal point by iterating for *t* ≥ 0

$$\mathfrak{g}^{(t)} \mapsto \mathfrak{g}^{(t+1)} = \mathfrak{g}^{(t)} - \varrho\_{t+1} \nabla\_{\mathfrak{F}} \mathfrak{g}(\mathfrak{g}^{(t)}),\tag{6.14}$$

for tempered *learning rates t*+<sup>1</sup> *>* 0. This algorithm is motivated by a first order Taylor expansion that determines the direction of the maximal local decrease of the objective function *g* supposed we are in position *β*, i.e.,

$$g(\widetilde{\mathfrak{f}}) = g(\mathfrak{f}) + \nabla\_{\mathfrak{f}}g(\mathfrak{f})^\top \left(\widetilde{\mathfrak{f}} - \mathfrak{f}\right) + o\left(\|\widetilde{\mathfrak{f}} - \mathfrak{f}\|\_2\right) \qquad \text{as} \quad \|\widetilde{\mathfrak{f}} - \mathfrak{f}\|\_2 \to 0.$$

The gradient descent algorithm (6.14) leads to the (unconstraint) minimum of the objective function *g* at convergence. A budget constraint like (6.6) leads to a convex constraint *<sup>β</sup>* <sup>∈</sup> *<sup>C</sup>* <sup>⊂</sup> <sup>R</sup>*q*+1. Consideration of such a convex constraint requires that we reformulate the gradient descent algorithm (6.14). The gradient descent step (6.14) can also be found, for given learning rate *t*+1, by solving the following

linearized problem for *g* with the Euclidean square distance penalty term (ridge regularization) for too big gradient descent steps

$$\underset{\mathfrak{H}\in\mathbb{R}^{q+1}}{\arg\min} \left\{ g(\mathfrak{G}^{(t)}) + \nabla\_{\mathfrak{F}}g(\mathfrak{G}^{(t)})^\top \left( \mathfrak{F} - \mathfrak{G}^{(t)} \right) + \frac{1}{2\varrho\_{t+1}} \|\mathfrak{F} - \mathfrak{G}^{(t)}\|\_{2}^{2} \right\}. \tag{6.15}$$

The solution to this optimization problem exactly gives the gradient descent step (6.14). This is now adapted to a constraint gradient descent update for convex constraint *C*:

$$\mathfrak{G}^{(t+1)} = \underset{\mathfrak{H} \in \mathcal{C}}{\arg\min} \left\{ \mathbf{g}(\mathfrak{G}^{(t)}) + \nabla\_{\mathfrak{F}} \mathbf{g}(\mathfrak{G}^{(t)})^\top \left( \mathfrak{F} - \mathfrak{G}^{(t)} \right) + \frac{1}{2\varrho\_{l+1}} \|\mathfrak{F} - \mathfrak{G}^{(t)}\|\_{2}^{2} \right\}. \tag{6.16}$$

The solution to this constraint convex optimization problem is obtained by, first, taking an unconstraint gradient descent step *<sup>β</sup>(t )* <sup>→</sup> *<sup>β</sup>(t )* <sup>−</sup> *t*+1∇*<sup>β</sup> g(β(t ))*, and, second, if this step is not within the convex set *C*, it is projected back to *C*; this is illustrated in Fig. 6.6, and it is called *projected gradient descent step* (justification is given in Lemma 6.6 below). Thus, the only difficulty in applying this projected gradient descent step is to find an efficient method of projecting the unconstraint solution (6.14)–(6.15) back to the convex constraint set *C*.

Assume that the convex constraint set *C* is expressed by a convex function *h* (not necessarily being differentiable). To solve (6.16) and to motivate the projected gradient descent step, we use the *proximal gradient method* discussed in Section 5.3.3 of Hastie et al. [184]. The proximal gradient method helps us to do the projection in the projected gradient descent step. We introduce the *generalized* *projection operator*, for *<sup>z</sup>* <sup>∈</sup> <sup>R</sup>*q*+<sup>1</sup>

$$\operatorname{prox}\_h(\mathbf{z}) = \operatorname\*{arg\,min}\_{\mathcal{B} \in \mathbb{R}^{q+1}} \left\{ \frac{1}{2} \|\mathbf{z} - \boldsymbol{\mathcal{B}}\|\_2^2 + h(\boldsymbol{\mathcal{B}}) \right\}. \tag{6.17}$$

This generalized projection operator should be interpreted as a square minimization problem *<sup>z</sup>* <sup>−</sup> *<sup>β</sup>*<sup>2</sup> <sup>2</sup> */*2 on a convex set *C* being expressed by its dual Lagrangian formulation described by the regularization term *h(β)*. The following lemma shows that the generalized projection operator solves the Lagrangian form of (6.16).

**Lemma 6.6** *Assume the convex constraint C is expressed by the convex function h. The generalized projection operator solves*

$$\mathfrak{g}^{(t+1)} = \operatorname{prox}\_{\varrho\_{t+1}h} \left( \mathfrak{g}^{(t)} - \varrho\_{t+1} \nabla\_{\mathfrak{F}} \mathfrak{g} (\mathfrak{g}^{(t)}) \right) \tag{6.18}$$

$$= \operatorname\*{arg\,min}\_{\mathfrak{g} \in \mathbb{R}^{q+1}} \left\{ \mathfrak{g} (\mathfrak{g}^{(t)}) + \nabla\_{\mathfrak{F}} \mathfrak{g} (\mathfrak{g}^{(t)})^\top \left( \mathfrak{g} - \mathfrak{g}^{(t)} \right) + \frac{1}{2\varrho\_{t+1}} \| \mathfrak{g} - \mathfrak{g}^{(t)} \|\_{2}^{2} + h(\mathfrak{g}) \right\}.$$

*Proof of Lemma 6.6* It suffices to consider the following calculation

$$\begin{split} &\frac{1}{2}\left\|\boldsymbol{\theta}^{(t)} - \varrho\_{t+1}\nabla\_{\boldsymbol{\mathsf{f}}\boldsymbol{\mathsf{g}}}\boldsymbol{\mathsf{g}}(\boldsymbol{\mathsf{g}}^{(t)}) - \boldsymbol{\mathsf{g}}\right\|\_{2}^{2} + \varrho\_{t+1}h(\boldsymbol{\mathsf{g}}) \\ &= \frac{1}{2}\varrho\_{t+1}^{2}\left\|\nabla\_{\boldsymbol{\mathsf{f}}\boldsymbol{\mathsf{g}}}\boldsymbol{\mathsf{g}}(\boldsymbol{\mathsf{g}}^{(t)})\right\|\_{2}^{2} - \varrho\_{t+1}\left\langle\nabla\_{\boldsymbol{\mathsf{f}}\boldsymbol{\mathsf{g}}}\boldsymbol{\mathsf{g}}(\boldsymbol{\mathsf{g}}^{(t)}), \boldsymbol{\mathsf{g}}^{(t)} - \boldsymbol{\mathsf{g}}\right\rangle + \frac{1}{2}\left\|\boldsymbol{\mathsf{g}}^{(t)} - \boldsymbol{\mathsf{g}}\right\|\_{2}^{2} + \varrho\_{t+1}h(\boldsymbol{\mathsf{g}}) \\ &= \frac{1}{2}\varrho\_{t+1}^{2}\left\|\nabla\_{\boldsymbol{\mathsf{f}}\boldsymbol{\mathsf{g}}}\boldsymbol{\mathsf{g}}(\boldsymbol{\mathsf{g}}^{(t)})\right\|\_{2}^{2} + \varrho\_{t+1}\left(\nabla\_{\boldsymbol{\mathsf{f}}\boldsymbol{\mathsf{g}}}\boldsymbol{\mathsf{g}}(\boldsymbol{\mathsf{g}}^{(t)})^{\top}\left(\boldsymbol{\mathsf{g}} - \boldsymbol{\mathsf{g}}^{(t)}\right) + \frac{1}{2\varrho\_{t+1}}\left\|\boldsymbol{\mathsf{g}}^{(t)} - \boldsymbol{\mathsf{g}}\right\|\_{2}^{2} + h(\boldsymbol{\mathsf{g}})\right). \end{split}$$

This is exactly the right objective function (in the round brackets) if we ignore all terms that are independent of *β*. This proves the lemma.

Thus, to solve the constraint optimization problem (6.16) we bring it into its dual Lagrangian form (6.18). Then we apply the generalized projection operator to the unconstraint solution to find the constraint solution, see Lemma 6.6. This approach will be successful if we can explicitly compute the generalized projection operator prox*h(*·*)*.

**Lemma 6.7** *The generalized projection operator* (6.17) *satisfies for LASSO constraint h(β)* = *λβ*−<sup>1</sup>

$$\begin{array}{rcl}\text{prox}\_{\mathsf{h}}(\mathsf{z}) &=& \mathcal{S}\_{\mathsf{\lambda}}^{\text{LASSO}}(\mathsf{z}) \stackrel{\text{def.}}{=} \left(z\_{0}, \text{sign}(z\_{1})(|z\_{1}| - \mathsf{\lambda})\_{+}, \dots, \text{sign}(z\_{q})(|z\_{q}| - \mathsf{\lambda})\_{+}\right)^{\mathsf{T}}, \\\\frac{\mathsf{z}}{} & \in \mathbb{R}^{q+1}. \end{array}$$

*Proof of Lemma 6.7* We need to solve for function *β* → *h(β)* = *λβ*−<sup>1</sup>

$$\operatorname{prox}\_{\lambda \parallel (\cdot)\_{-} \parallel\_{1}}(\mathbf{z}) = \operatorname\*{arg\,min}\_{\mathcal{B} \neq \mathbb{R}^{q+1}} \left\{ \frac{1}{2} \left\| \mathbf{z} - \boldsymbol{\mathcal{B}} \right\|\_{2}^{2} + \lambda \left\| \boldsymbol{\mathcal{B}}\_{-} \right\|\_{1} \right\} = \operatorname\*{arg\,min}\_{\mathcal{B} \neq \mathbb{R}^{q+1}} \left\{ \frac{1}{2} \sum\_{j=0}^{q} (z\_{j} - \boldsymbol{\beta}\_{j})^{2} + \lambda \sum\_{j=1}^{q} |\beta\_{j}| \right\}.$$

This decouples into *q* + 1 independent optimization problems. The first one is solved by *β*<sup>0</sup> = *z*<sup>0</sup> and the remaining ones are solved by the soft-thresholding operator (6.13). This finishes the proof.

We conclude that the constraint optimization problem (6.16) for the (convex) LASSO constraint *C* = {*β*; *β*−<sup>1</sup> ≤ *c*} is brought into its dual Lagrangian form (6.18) of Lemma 6.6 with *h(β)* = *λβ*−<sup>1</sup> for suitable *λ* = *λ(c)*. The LASSO regularized parameter estimation is then solved by first performing an unconstraint gradient descent step *<sup>β</sup>(t )* <sup>→</sup> *<sup>β</sup>(t )* <sup>−</sup> *t*+1∇*βg(β(t ))*, and this updated parameter is projected back to *C* using the generalized projection operator of Lemma 6.7 with *h(β)* = *t*+1*λβ*−1.

Proximal gradient descent algorithm for LASSO

1. Make the gradient descent step for a suitable learning rate *t*+<sup>1</sup> *>* 0

$$
\mathfrak{g}^{(t)} \mapsto \widetilde{\mathfrak{g}}^{(t+1)} = \mathfrak{g}^{(t)} - \varrho\_{t+1} \nabla\_{\mathfrak{g}} \mathfrak{g}(\mathfrak{g}^{(t)}) .
$$

2. Perform soft-thresholding of the gradient descent solution

$$
\widetilde{\mathcal{B}}^{(t+1)} \mapsto \mathcal{B}^{(t+1)} = \mathcal{S}^{\text{LASSO}}\_{\mathcal{Q}\_{t+1}\lambda} \left( \widetilde{\mathcal{B}}^{(t+1)} \right),
$$

where the latter soft-thresholding function is defined in Lemma 6.7.

3. Iterate these two steps until a stopping criterion is met.

If the gradient ∇*βg(*·*)* is Lipschitz continuous with Lipschitz constant *L >* 0, the proximal gradient descent algorithm will converge at rate *O(*1*/t)* for a fixed step size 0 *<*  = *t*+<sup>1</sup> ≤ *L*, see Section 4.2 in Parikh–Boyd [292].

*Example 6.8 (LASSO Regression)* We revisit Example 6.5 which considers claim size modeling using model Gamma GLM1. In order to apply the proximal gradient descent algorithm for LASSO regularization we need to calculate the gradient of the negative log-likelihood. In the gamma case with log-link, it is given by, see Example 5.5,

$$\begin{aligned} -\nabla\_{\boldsymbol{\mathcal{B}}} \ell\_{\boldsymbol{Y}}(\boldsymbol{\mathcal{B}}) &= -\mathfrak{X}^{\top} W(\boldsymbol{\mathcal{B}}) \boldsymbol{\mathcal{R}}(\boldsymbol{Y}, \boldsymbol{\mathcal{B}}) \\ &= -\mathfrak{X}^{\top} \text{diag}\left(\frac{n\_{1}}{\varphi}, \dots, \frac{n\_{m}}{\varphi}\right) \left(\frac{Y\_{1}}{\mu\_{1}} - 1, \dots, \frac{Y\_{m}}{\mu\_{m}} - 1\right)^{\top}, \end{aligned}$$

**Fig. 6.7** LASSO regularized MLEs in model Gamma GLM1: (lhs) in-sample losses as a function of the regularization parameter *λ >* 0, (rhs) resulting *β* LASSO *<sup>j</sup> (λ)* for 1 ≤ *j* ≤ *q*

where *<sup>m</sup>* <sup>∈</sup> <sup>N</sup> is the number of policies with claims, and *μi* <sup>=</sup> *μi(β)* <sup>=</sup> exp*<sup>β</sup>, <sup>x</sup><sup>i</sup>* . We set *ϕ* = 1 as this constant can be integrated into the learning rates *t*+1.

We have implemented the proximal gradient descent algorithm ourselves using an equidistant grid for the regularization parameter *λ >* 0, a fixed learning rate *t*+<sup>1</sup> = 0*.*05 and normalized features. Since this has been done rather brute force, the results presented in Fig. 6.7 look a bit wiggly. These results should be compared to Fig. 6.3. We see that, in contrast to ridge regularization, less important regression parameters are shrunk exactly to zero in LASSO regularization. We give the order in which the parameters are shrunk to zero: *β*<sup>1</sup> (OwnerAge), *β*<sup>4</sup> (RiskClass), *β*<sup>6</sup> (VehAge2), *β*<sup>8</sup> (BonusClass), *β*<sup>7</sup> (GenderMale), *β*<sup>2</sup> (OwnerAge2), *β*<sup>3</sup> (AreaGLM) and *β*<sup>5</sup> (VehAge). In view of Listing 5.11 this order seems a bit surprising. The reason for this surprising order is that we have grouped features here, and, obviously, these should be considered jointly. In particular, we first drop OwnerAge because this can also be partially explained by OwnerAge2, therefore, we should not treat these two variables individually. Having this weakness in mind supports the conclusions drawn from the Wald tests in Listing 5.11, and we come back to this in Example 6.10, below.


#### **Oracle Property**

An interesting question is whether the chosen regularization fulfills the so-called oracle property. For simplicity, we assume to work in the normalized Gaussian case that allows us to exclude the intercept *β*0, see (6.11). Thus, we work with a regression parameter *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* . Assume that there is a true data model that can be described by the (true) regression parameter *<sup>β</sup>*<sup>∗</sup> <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* . Denote by *<sup>A</sup>*<sup>∗</sup> = {*<sup>j</sup>* <sup>∈</sup> {1*,...,q*}; *β*<sup>∗</sup> *<sup>j</sup>* = <sup>0</sup>} the set of feature components of *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* that determine the true regression function, and we assume <sup>|</sup>*A*∗| *< q*. Denote by *<sup>β</sup>n(λ)* the parameter estimate that has been received by the regularized MAP estimation for a given regularization parameter *λ* ≥ 0 and based on i.i.d. data of sample size *n*. We say that *( <sup>β</sup>n(λn))n*∈<sup>N</sup> fulfills the *oracle property* if there exists a sequence *(λn)n*∈<sup>N</sup> of regularization parameters *λn* ≥ 0 such that

$$\lim\_{n \to \infty} \mathbb{P}[\widehat{\mathcal{A}}\_{n} = \mathcal{A}^{\*}] = 1,\tag{6.19}$$

$$\sqrt{n}\left(\widehat{\mathfrak{H}}\_{n,\mathcal{A}^\*}(\lambda\_n) - \mathfrak{F}\_{\mathcal{A}^\*}^\*\right) \Rightarrow \mathcal{N}\left(0, \mathcal{T}\_{\mathcal{A}^\*}^{-1}\right) \qquad \text{as } n \to \infty,\tag{6.20}$$

where *<sup>A</sup> <sup>n</sup>* = {*<sup>j</sup>* ∈ {1*,...,q*};*( <sup>β</sup>n(λn))j* = <sup>0</sup>}, *<sup>β</sup><sup>A</sup>* only considers the components in *A* ⊂ {1*,...,q*}, and *IA*<sup>∗</sup> is Fisher's information matrix on the true feature components. The first oracle property (6.19) tells us that asymptotically we choose the right feature components, and the second oracle property (6.20) tells us that we have asymptotic normality and, in particular, consistency on the right feature components.

Zou [408] states that LASSO regularization, in general, does not satisfy the oracle property. LASSO regularization can perform variable selection, however, as Zou [408] argues, there are situations where consistency is violated and, therefore, the oracle property cannot hold in general. Zou [408] therefore proposes an adaptive LASSO regularization method. Alternatively, Fan–Li [124] introduced smoothly clipped absolute deviation (SCAD) regularization which is a non-convex regularization that possesses the oracle property. SCAD regularization of *β* is obtained by penalizing

$$J\_{\lambda}(\boldsymbol{\varbeta}) = \sum\_{j=1}^{q} \lambda |\beta\_{j}| \mathbbm{1}\_{\{|\beta\_{j}| \leq \lambda\}} - \frac{|\beta\_{j}|^{2} - 2a\lambda |\beta\_{j}| + \lambda^{2}}{2(a-1)} \mathbbm{1}\_{\{\lambda < |\beta\_{j}| \leq a\lambda\}} + \frac{(a+1)\lambda^{2}}{2} \mathbbm{1}\_{\{|\beta\_{j}| > a\lambda\}},$$

for a hyperparameter *a >* 2. This function is continuous and differentiable except in *βj* = 0 with partial derivatives for *β >* 0

$$
\lambda \left( \mathbb{1}\_{\{\beta \le \lambda\}} + \frac{(a\lambda - \beta)\_+}{\lambda(a-1)} \mathbb{1}\_{\{\beta > \lambda\}} \right) \dots
$$

**Fig. 6.8** (lhs) LASSO soft-thresholding operator *x* → *Sλ(x)* for *λ* = 4 (red dotted lines), (rhs) SCAD thresholding operator *<sup>x</sup>* <sup>→</sup> *<sup>S</sup>*SCAD *<sup>λ</sup> (x)* for *λ* = 4 and *a* = 3

Thus, we have a constant LASSO-like slope *λ >* 0 for 0 *< β* ≤ *λ*, shrinking some components exactly to zero. For *β > aλ* the slope is 0, removing regularization, and it is concatenated between the two scenarios. The thresholding operator for SCAD regularization is given by, see Fan–Li [124],

$$\mathcal{S}^{\text{SCAD}}\_{\lambda}(\boldsymbol{\chi}) = \begin{cases} \text{sign}(\boldsymbol{\chi})(|\boldsymbol{\chi}| - \lambda)\_{+} & \text{for } |\boldsymbol{\chi}| \le 2\lambda, \\\frac{(a-1)\boldsymbol{\chi} - \text{sign}(\boldsymbol{\chi})a\lambda}{a-2} & \text{for } 2\lambda < |\boldsymbol{\chi}| \le a\lambda, \\\boldsymbol{\chi} & \text{for } |\boldsymbol{\chi}| > a\lambda. \end{cases}$$

Figure 6.8 compares the two thresholding operators of LASSO and SCAD.

Alternatively, we propose to do variable selection with LASSO regularization in a first step. Since the resulting LASSO regularized estimator may not be consistent, one should explore a second regression step where one uses an un-penalized regression model on the LASSO selected components, we also refer to Lee et al. [237].

## *6.2.5 Group LASSO Regularization*

In Example 6.8 we have seen that if there are natural groups within the feature components they should be treated simultaneously. Assume we have a group structure *<sup>x</sup>* <sup>=</sup> *(x*0*, <sup>x</sup>*1*,..., <sup>x</sup>K)* with groups *<sup>x</sup><sup>k</sup>* <sup>∈</sup> <sup>R</sup>*qk* that should be treated simultaneously. This motivates the grouped penalties proposed by Yuan–Lin [398], see (6.5),

$$\widehat{\mathfrak{F}}^{\text{group}} = \widehat{\mathfrak{F}}^{\text{group}}(\lambda) \;= \operatorname\*{arg\,max}\_{\mathfrak{F} = (\mathfrak{F}\_0, \mathfrak{F}\_1, \dots, \mathfrak{F}\_K)} \frac{1}{n} \ell\_Y(\mathfrak{F}) - \lambda \sum\_{k=1}^K \|\mathfrak{F}\_k\|\_2,\qquad(6.21)$$

where we assume a group structure in the linear predictor providing

$$\mathfrak{x} \mapsto \eta(\mathfrak{x}) = \langle \mathfrak{f}, \mathfrak{x} \rangle = \beta\_0 + \sum\_{k=1}^{K} \langle \mathfrak{f}\_k, \mathfrak{x}\_k \rangle.$$

LASSO regularization is a special case of this grouped regularization, namely, if all groups 1 ≤ *k* ≤ *K* only contain one single component, i.e., *K* = *q*, we have *<sup>β</sup>*group <sup>=</sup> *<sup>β</sup>*LASSO.

The side constraint in (6.21) is convex, and the optimization problem (6.21) can again be solved by the proximal gradient descent algorithm. That is, in view of Lemma 6.6, the only difficulty is the calculation of the generalized projection operator for regularization term *h(β)* = *λ <sup>K</sup> <sup>k</sup>*=<sup>1</sup> *<sup>β</sup><sup>k</sup>*2. We therefore need to solve for *<sup>z</sup>* <sup>=</sup> *(z*0*, <sup>z</sup>*1*,..., <sup>z</sup>K)*, *<sup>z</sup><sup>k</sup>* <sup>∈</sup> <sup>R</sup>*qk* ,

$$\begin{split} \operatorname{prox}\_{h}(\boldsymbol{z}) &= \operatorname\*{arg\,min}\_{\boldsymbol{\mathcal{B}} = (\boldsymbol{\beta}\_{0}, \boldsymbol{\beta}\_{1}, \dots, \boldsymbol{\mathcal{B}}\_{K})} \left\{ \frac{1}{2} \left\| \boldsymbol{z} - \boldsymbol{\mathcal{B}} \right\|\_{2}^{2} + \lambda \sum\_{k=1}^{K} \left\| \boldsymbol{\mathcal{B}}\_{k} \right\|\_{2} \right\} \\ &= \left( \boldsymbol{z}\_{0}, \left( \operatorname\*{arg\,min}\_{\boldsymbol{\mathcal{B}}\_{k} \in \mathbb{R}^{q\_{k}}} \left\{ \frac{1}{2} \left\| \boldsymbol{z}\_{k} - \boldsymbol{\mathcal{B}}\_{k} \right\|\_{2}^{2} + \lambda \left\| \boldsymbol{\mathcal{B}}\_{k} \right\|\_{2} \right\} \right)\_{1 \leq k \leq K} \right). \end{split}$$

The latter highlights that the problem decouples into *K* independent problems. Thus, we need to solve for all 1 ≤ *k* ≤ *K* the optimization problems

$$\underset{\mathcal{B}\_k \in \mathbb{R}^{q\_k}}{\arg\min} \left\{ \frac{1}{2} \left\| \mathbf{z}\_k - \boldsymbol{\mathcal{B}}\_k \right\|\_2^2 + \lambda \left\| \boldsymbol{\mathcal{B}}\_k \right\|\_2 \right\}.$$

**Lemma 6.9** *The group LASSO generalized soft-thresholding operator satisfies for <sup>z</sup><sup>k</sup>* <sup>∈</sup> <sup>R</sup>*qk*

$$\mathcal{S}^{q\_k}\_{\lambda}(\mathbf{z}\_k) = \underset{\mathcal{B}\_k \in \mathbb{R}^{q\_k}}{\arg\min} \left\{ \frac{1}{2} \left\| \mathbf{z}\_k - \boldsymbol{\mathcal{B}}\_k \right\|\_2^2 + \lambda \left\| \boldsymbol{\mathcal{B}}\_k \right\|\_2 \right\} = \boldsymbol{z}\_k \left( 1 - \frac{\lambda}{\|\mathbf{z}\_k\|\_2} \right)\_+ \in \mathbb{R}^{qk},$$

*and for the generalized projection operator for h(β)* = *λ <sup>K</sup> <sup>k</sup>*=<sup>1</sup> *<sup>β</sup><sup>k</sup>*<sup>2</sup> *we have*

$$\text{prox}\_{\boldsymbol{h}}(\mathbf{z}) = \mathcal{S}^{\text{group}}\_{\boldsymbol{\lambda}}(\mathbf{z}) \stackrel{\text{def.}}{=} \left( z\_0, \mathcal{S}^{q\_1}\_{\boldsymbol{\lambda}}(\mathbf{z}\_1), \dots, \mathcal{S}^{q\_K}\_{\boldsymbol{\lambda}}(\mathbf{z}\_K) \right),$$

*for <sup>z</sup>* <sup>=</sup> *(z*0*, <sup>z</sup>*1*,..., <sup>z</sup>K) with <sup>z</sup><sup>k</sup>* <sup>∈</sup> <sup>R</sup>*qk .*

*Proof* We prove this lemma. In a first step we have

$$\underset{\mathfrak{H}\_{k}}{\arg\min} \left\{ \frac{1}{2} \left\| \boldsymbol{z}\_{k} - \boldsymbol{\mathcal{B}}\_{k} \right\|\_{2}^{2} + \lambda \left\| \boldsymbol{\mathcal{B}}\_{k} \right\|\_{2} \right\} \\ \quad = \underset{\boldsymbol{\mathcal{B}}\_{k} = \boldsymbol{\varrho} \boldsymbol{z}\_{k}/\|\boldsymbol{z}\_{k}\|\_{2}}{\arg\min} \left\{ \frac{1}{2} \left\| \boldsymbol{z}\_{k} \right\|\_{2}^{2} \left( 1 - \frac{\boldsymbol{\varrho}}{\|\boldsymbol{z}\_{k}\|\_{2}} \right)^{2} + \lambda \boldsymbol{\varrho} \right\}, \\ \boldsymbol{\Lambda} = \underset{\boldsymbol{\mathcal{B}}\_{k} = \boldsymbol{\varrho} \boldsymbol{z}\_{k}/\|\boldsymbol{z}\_{k}\|\_{2}}{\arg\min} \left\{ \frac{1}{2} \left\| \boldsymbol{z}\_{k} - \boldsymbol{\varrho} \right\|\_{2}^{2} + \lambda \left\| \boldsymbol{z}\_{k} - \boldsymbol{\varrho} \right\|\_{2}^{2} \right\}.$$

this follows because the square distance 2 <sup>2</sup>*z<sup>k</sup>* <sup>−</sup> *<sup>β</sup><sup>k</sup>* 2 2 2 <sup>2</sup> <sup>=</sup> *<sup>z</sup><sup>k</sup>*<sup>2</sup> <sup>2</sup> − 2*zk, βk* + 2 <sup>2</sup>*β<sup>k</sup>* 2 2 2 2 is minimized if *z<sup>k</sup>* and *β<sup>k</sup>* point into the same direction. Thus, there remains the minimization of the objective function in ≥ 0. The first derivative is given by

$$\frac{\partial}{\partial \varrho} \left( \frac{1}{2} \| \mathbf{z}\_k \| \_2^2 \left( 1 - \frac{\varrho}{\| \mathbf{z}\_k \|\_2} \right)^2 + \lambda \varrho \right) = - \| \mathbf{z}\_k \|\_2 \left( 1 - \frac{\varrho}{\| \mathbf{z}\_k \|\_2} \right) + \lambda = \lambda - \| \mathbf{z}\_k \|\_2 + \varrho.$$

If *z<sup>k</sup>*<sup>2</sup> *> λ* we have = *z<sup>k</sup>*<sup>2</sup> −*λ >* 0, and otherwise we need to set = 0. This implies

$$\mathcal{S}^{q\_k}\_{\lambda}(\mathfrak{z}\mathfrak{z}) = (\|\mathfrak{z}\mathfrak{z}\|\_2 - \lambda)\_+ \, \mathfrak{z}k/\|\mathfrak{z}k\|\_2.$$

This completes the proof.

**Fig. 6.9** Group LASSO regularized MLEs in model Gamma GLM1: (lhs) in-sample losses as a function of the regularization parameter *λ >* 0, (rhs) resulting *β* group *<sup>j</sup> (λ)* for 1 ≤ *j* ≤ *q*

Proximal gradient descent algorithm for group LASSO

1. Make the gradient descent step for a suitable learning rate *t*+<sup>1</sup> *>* 0

$$
\mathfrak{g}^{(t)} \mapsto \widetilde{\mathfrak{g}}^{(t+1)} = \mathfrak{g}^{(t)} - \varrho\_{t+1} \nabla\_{\mathfrak{g}} \mathfrak{g}(\mathfrak{g}^{(t)}).
$$

2. Perform soft-thresholding of the gradient descent solution

$$
\widetilde{\mathcal{B}}^{(t+1)} \mapsto \mathcal{B}^{(t+1)} = \mathcal{S}^{\text{group}}\_{\varrho\_{t+1}\lambda} \left( \widetilde{\mathcal{B}}^{(t+1)} \right),
$$

where the latter soft-thresholding function is defined in Lemma 6.9.

3. Iterate these two steps until a stopping criterion is met.

*Example 6.10 (Group LASSO Regression)* We revisit Example 6.8 which considers claim size modeling using model Gamma GLM1. This time we group the variables OwnerAge and OwnerAge<sup>2</sup> (*β*1*, β*2) as well as VehAge and VehAge<sup>2</sup> (*β*5*, β*6). The results are shown in Fig. 6.9.

The order in which the parameters are regularized to zero is: *β*<sup>4</sup> (RiskClass), *β*<sup>8</sup> (BonusClass), *β*<sup>7</sup> (GenderMale), *(β*1*, β*2*)* (OwnerAge, OwnerAge2), *β*<sup>3</sup> (AreaGLM) and *(β*5*, β*6) (VehAge, VehAge2). This order now reflects more the variable importance as received from the Wald statistics of Listing 5.11, and it shows that grouped features should be regularized jointly in order to determine their importance. -

## **6.3 Expectation-Maximization Algorithm**

## *6.3.1 Mixture Distributions*

In many applied problems there does not exist a simple off-the-shelf distribution that is suitable to model the whole range of observations. We think of claim size modeling which may range from small to very large claims; the main body of the data may look like, say, gamma distributed, but the tail of the data being regularly varying. Another related problem is that claims may come from different insurance policy modules. For instance, in property insurance, one can insure water damage, fire, glass and theft claims on the same insurance policy, and feature information about the claim type may not always be available. In such cases, it looks attractive to choose a mixture or a composition of different distributions. In this section we focus on mixtures.

Choose a fixed integer *K* bigger than 1 and define the *(K* − 1*)*-unit simplex excluding the edges by

$$\Delta\_K = \left\{ p \in (0,1)^K \; ; \; \sum\_{k=1}^K p\_k = 1 \right\}.\tag{6.22}$$

*K* defines the family of categorical distributions with *K* levels (all levels having a strictly positive probability). These distributions belong to the vector-valued parameter EF which we have met in Sects. 2.1.4 and 5.7.

The idea behind mixture distributions is to mix *K* different distributions with a mixture probability *p* ∈ *K*. For instance, we can mix *K* different EDF densities *fk* by considering

$$Y \sim \sum\_{k=1}^{K} p\_k f\_k(\mathbf{y}; \theta\_k, \upsilon/\varphi\_k) = \sum\_{k=1}^{K} p\_k \exp\left\{ \frac{\mathbf{y}\theta\_k - \kappa\_k(\theta\_k)}{\varphi\_k/\upsilon} + a\_k(\mathbf{y}; \upsilon/\varphi\_k) \right\},\tag{6.23}$$

with cumulant functions *θk* ∈ *<sup>k</sup>* → *κk(θk)*, exposure *v >* 0 and dispersion parameters *ϕk >* 0, for 1 ≤ *k* ≤ *K*.

At the first sight, this does not look very spectacular and parameter estimation seems straightforward. If we consider the log-likelihood of *n* independent random variables *Y* = *(Y*1*,...,Yn)* following mixture density (6.23) we receive loglikelihood function

$$\ell(\boldsymbol{\theta}, \mathbf{p}) \mapsto \ell\_Y(\boldsymbol{\theta}, \mathbf{p}) = \sum\_{l=1}^n \ell\_{Y\_l}(\boldsymbol{\theta}, \mathbf{p}) = \sum\_{l=1}^n \log \left( \sum\_{k=1}^K p\_k f\_k(Y\_l; \theta\_k, v\_l/\varphi\_k) \right), \tag{6.24}$$

for canonical parameter *θ* = *(θ*1*,...,θK)*- ∈ = <sup>1</sup> ×···× *<sup>K</sup>* and mixture probability *p* ∈ *K*. Unfortunately, MLE of *(θ, p)* in (6.24) is not that simple. Note, the summation over 1 ≤ *k* ≤ *K* is inside of the logarithmic function, and the use of the Newton–Raphson algorithm may be cumbersome. The Expectation-Maximization (EM) algorithm presented in Sect. 6.3.3, below, makes parameter estimation feasible. In a nutshell, the EM algorithm leads to a sequence of parameter estimates for *(θ, p)* that monotonically increases the log-likelihood in each iteration of the algorithm. Thus, we can receive an approximation to the MLE of *(θ, p)*.

Nevertheless, model fitting may still be difficult for the following reasons. Firstly, the log-likelihood function of a mixture distribution does not need to be bounded, we highlight this in Example 6.13, below. In that case, MLE is not a well-defined problem. Secondly, even in very simple situations, the log-likelihood function (6.24) can have multiple local maximums. This usually happens if the data is clustered and the clusters are well separated. In that case of multiple local maximums, convergence of the EM algorithm does not guarantee that we have found the global maximum. Thirdly, convergence of the log-likelihood function through the EM algorithm does not guarantee that also the sequence of parameter estimates of *(θ, p)* converges. The latter needs additional examination and regularity conditions.

Figure 6.10 (lhs) shows a density of a mixture distribution mixing *K* = 3 gamma densities with shape parameters *αk* = 1*,* 20*,* 40 (orange, green and blue) and mixture probability *p* = *(*0*.*7*,* 0*.*1*,* 0*.*2*)*-; the mixture components are already multiplied with *p*. The resulting mixture density in red color is continuous. Figure 6.10 (rhs) replaces the blue gamma component of the plot on the left-hand side by a Pareto component (in blue). As a result we observe that the resulting mixture density in red is no longer continuous. This example is often used in practice, however, the discontinuity may be a serious issue in applications and one may use a Lomax (Pareto Type II) component instead, we refer to Sect. 2.2.5.

**Fig. 6.10** (lhs) Mixture distribution mixing three gamma densities, and (rhs) mixture distributions mixing two gamma components and a Pareto component with mixture probabilities *p* = *(*0*.*7*,* 0*.*1*,* 0*.*2*)* for orange, green and blue components (the density components are already multiplied with *p*)

## *6.3.2 Incomplete and Complete Log-Likelihoods*

A mixture distribution can be defined (brute force) by just defining a mixture density as in (6.23). Alternatively, we could define a mixture distribution in a more constructive way. In the following we discuss this constructive derivation which will allow us to efficiently fit mixture distributions to data *Y*. For our outline we focus on (6.23), but all results presented below hold true in much more generality.

Choose a categorical random variable *Z* with *K* ≥ 2 levels having probabilities <sup>P</sup>[*<sup>Z</sup>* <sup>=</sup> *<sup>k</sup>*] = *pk <sup>&</sup>gt;* <sup>0</sup> for 1 <sup>≤</sup> *<sup>k</sup>* <sup>≤</sup> *<sup>K</sup>*, that is, with *<sup>p</sup>* <sup>∈</sup> *K*. The main idea is to sample in a first step level *Z* = *k* ∈ {1*,...,K*}, and in a second step *Y* |{*<sup>Z</sup>*=*k*} ∼ *fk(y*; *θk, v/ϕk)*, based on the selected level *Z* = *k*. The random tuple *(Y, Z)* has joint density

$$(Y, Z) \sim f\_{\theta, p}(\mathbf{y}, k) = p\_k f\_k(\mathbf{y}; \theta\_k, \upsilon/\varphi\_k),$$

and the marginal density of *Y* is exactly given by (6.23). In this interpretation we have a hierarchical model *(Y, Z)*. If only *Y* is available for parameter estimation, then we are in the situation of *incomplete information* because information about the first hierarchy *Z* is missing. If both *Y* and *Z* are available we say that we have *complete information*.

For the subsequent derivations we use a different coding of the categorical random variable *Z*, namely, *Z* can be represented in the following one-hot encoding version

$$\mathbf{Z} = (Z\_{\mathbb{L}}, \dots, Z\_{K})^{\top} = (\mathbb{1}\_{\{Z=\mathbb{L}\}}, \dots, \mathbb{1}\_{\{Z=K\}})^{\top},\tag{6.25}$$

these are the *K* corners of the *(K* − 1*)*-unit simplex *K*. One-hot encoding differs from dummy coding (5.21). One-hot encoding does not lead to a full rank design matrix because there is a redundancy, that is, we can drop one component of *Z* and still have the same information. One-hot encoding *Z* of *Z* allows us to extend the *incomplete (data) log-likelihood Y (θ, p)*, see (6.23)–(6.24), under complete information *(Y,Z)* as follows

$$\begin{split} \ell\_{(Y,Z)}(\theta,\mathfrak{p}) &= \log \left( \prod\_{k=1}^{K} \left( p\_k f\_k(Y;\theta\_k, v/\varphi\_k) \right)^{Z\_k} \right) \\ &= \log \left( \prod\_{k=1}^{K} \left( p\_k \exp \left\{ \frac{Y \theta\_k - \kappa\_k(\theta\_k)}{\varphi\_k/v} + a\_k(Y; v/\varphi\_k) \right\} \right)^{Z\_k} \right) \\ &= \sum\_{k=1}^{K} Z\_k \left( \log(p\_k) + \frac{Y \theta\_k - \kappa\_k(\theta\_k)}{\varphi\_k/v} + a\_k(Y; v/\varphi\_k) \right). \end{split} \tag{6.26}$$

*(Y,Z)(θ, p)* is called *complete (data) log-likelihood*. As a consequence of this last expression we observe that under complete information *(Yi,Zi)*<sup>1</sup>≤*i*≤*n*, the MLE of *θ* and *p* can be determined completely analogously to above. Namely, *θk* is estimated from all observations *Yi* for which *Z<sup>i</sup>* belongs to level *k*, and the level indicators *(Zi)*<sup>1</sup>≤*i*≤*<sup>n</sup>* are used to estimate the mixture probability *p*. Thus, the objective function nicely decouples under complete information into independent parts for *θk* and *p* estimation. There remains the question of how to fit this model under incomplete information *Y*. The next section will discuss this problem.

## *6.3.3 Expectation-Maximization Algorithm for Mixtures*

The EM algorithm is a general purpose tool for parameter estimation under incomplete information. The EM algorithm has been introduced within the EF by Sundberg [348, 349]. Sundberg's developments have been based on the vectorvalued parameter EF with statistics *S(Y)* <sup>∈</sup> <sup>R</sup>*k*, see (3.17), and he solved the estimation problem under the assumption that *S(Y)* is not fully known. These results have been generalized to MLE under incomplete data in the celebrated work of Dempster et al. [96] and Wu [385]. The monograph of McLachlan–Krishnan [267] gives the theory behind the EM algorithm, and it also provides a historical review in Section 1.8. In actuarial science the EM algorithm is increasingly used to solve various kinds of problems of incomplete data. Mixture models of Erlang distributions are considered in Lee–Lin [240], Yin–Lin [396] and Fung et al. [146, 147]; general Erlang mixtures are universal approximators to positive distributions (in the weak convergence sense), and regularized Erlang mixtures and mixtures of experts models are determined using the EM algorithm to receive approximations to the true underlying model. Miljkovic–Grün [278], Parodi [295] and Fung et al. [148] consider the EM algorithm for mixtures of general distributions, in particular, mixtures of small and large claims distributions. Verbelen et al. [371], Blostein– Miljkovic [40], Grün–Miljkovic [173] and Fung et al. [147] use the EM algorithm for censored and/or truncated observations, and dispersion modeling is performed with the EM algorithm in Tzougas–Karlis [359]. (Inhomogeneous) phase-type and matrix Mittag–Leffler distributions are fitted with the EM algorithm in Asmussen et al. [14], Albrecher et al. [8] and Bladt [37], and the EM algorithm is used to fit mixture density networks (MDNs) in Delong et al. [95]. Parameter uncertainty is investigated in O'Hagan et al. [289] using the bootstrap method. The present section is mainly based on McLachlan–Krishnan [267].

As mentioned above, the EM algorithm is a general purpose tool for parameter estimation under incomplete data, and we describe the variant of the EM algorithm which is useful for our mixture distribution setup given in (6.26). We give a justification for its functioning below. The EM algorithm is an iterative algorithm that performs a Bayesian expectation step (E-step) to infer the latent variable *Z*, given the model parameters and *Y* . Next, it performs a maximization step (M-step) for MLE of the parameters given the observation *Y* and the estimated latent variable *<sup>Z</sup>*. More specifically, the E-step and the M-step look as follows.

• **E-step.** Calculate the posterior probability of the event that a given observation *Y* has been generated from the *k*-th component of the mixture distribution. Bayes' rule allows us to infer this posterior probability (for given *θ* and *p*) from (6.26)

$$\mathbb{P}\_{\theta, p}[Z\_k = 1 | Y] = \frac{p\_k f\_k(Y; \theta\_k, v/\varphi\_k)}{\sum\_{l=1}^K p\_l f\_l(Y; \theta\_l, v/\varphi\_l)}.$$

The posterior (Bayesian) estimate for *Zk* after having observed *Y* is given by

$$\widehat{Z}\_k(\theta, \,\,p|Y) \stackrel{\text{def.}}{=} \mathbb{E}\_{\theta, \,p}[Z\_k|Y] = \mathbb{P}\_{\theta, \,p}[Z\_k = 1|Y] \qquad \text{for } 1 \le k \le K. \tag{6.27}$$

This posterior mean *<sup>Z</sup>* <sup>=</sup> *<sup>Z</sup>(θ, <sup>p</sup>*|*Y )* <sup>=</sup> *(Z* <sup>1</sup>*(θ, <sup>p</sup>*|*Y ), . . . ,Z K(θ, <sup>p</sup>*|*Y ))*- ∈ *K* is used as an estimate for the (unobserved) latent variable *Z*; note that this posterior mean depends on the unknown parameters *(θ, p)*.

• **M-step.** Based on *<sup>Y</sup>* and *<sup>Z</sup>* the parameters *<sup>θ</sup>* and *<sup>p</sup>* are estimated with MLE.

Alternation of these two steps provide the following recursive algorithm. We assume to have independent responses *(Yi,Zi)*, 1 ≤ *i* ≤ *n*, following the mixture distribution (6.26), where, for simplicity, we assume that only the volumes *vi >* 0 are dependent on *i*.

EM algorithm for mixture distributions

	- **E-step.** Given parameter *( <sup>θ</sup> (t*−1*) , <sup>p</sup>(t*−1*) )* ∈ × *K* estimate the latent variables *Zi*, 1 ≤ *i* ≤ *n*, by their conditional expectations, see (6.27),

$$\widehat{\mathbf{Z}}\_{l}^{(t)} = \widehat{\mathbf{Z}}\left(\widehat{\boldsymbol{\theta}}^{(t-l)}, \widehat{\boldsymbol{\mathfrak{p}}}^{(t-l)} \Big| \boldsymbol{Y}\_{l}\right) \\ = \mathbb{E}\_{\widehat{\boldsymbol{\theta}}^{(t-l)}, \widehat{\boldsymbol{\mathfrak{p}}}^{(t-l)}}[\mathbf{Z}\_{l}|\boldsymbol{Y}\_{l}] \in \Delta\_{K}. \tag{6.28}$$

• **M-step.** Calculate the MLE *( <sup>θ</sup> (t ), <sup>p</sup>(t ))* <sup>∈</sup> <sup>×</sup> *K* based on (complete) observations *((Y*1*, <sup>Z</sup>(t )* <sup>1</sup> *), . . . , (Yn, <sup>Z</sup>(t ) <sup>n</sup> ))*, i.e., solve the score equations, see (6.26),

$$\nabla\_{\theta} \left( \sum\_{l=1}^{n} \sum\_{k=1}^{K} \widehat{Z}\_{l,k}^{(l)} \frac{Y\_l \theta\_k - \kappa\_k(\theta\_k)}{\varphi\_k / \upsilon\_l} \right) = 0,\tag{6.29}$$

$$\nabla\_{p\_{-}} \left( \sum\_{l=1}^{n} \sum\_{k=1}^{K} \widehat{Z}\_{l,k}^{(l)} \log(p\_k) \right) = 0,\tag{6.30}$$

where *p*<sup>−</sup> = *(p*1*,...,pK*−1*)* and setting *pK* <sup>=</sup> <sup>1</sup> <sup>−</sup> *K*−<sup>1</sup> *<sup>k</sup>*=<sup>1</sup> *pk* <sup>∈</sup> *(*0*,* <sup>1</sup>*)*.

#### *Remarks 6.11*


• If we calculate the scores element-wise we receive

$$\begin{aligned} \frac{\partial}{\partial \theta\_k} \sum\_{i=1}^n \frac{Y\_i \theta\_k - \kappa\_k(\theta\_k)}{\varphi\_k / (\upsilon\_i \widehat{Z}\_{i,k}^{(t)})} &= 0, \\\frac{\partial}{\partial p\_k} \sum\_{i=1}^n \left( \widehat{Z}\_{i,k}^{(t)} \log(p\_k) + \widehat{Z}\_{i,K}^{(t)} \log(p\_K) \right) &= 0, \end{aligned}$$

recall normalization *pK* <sup>=</sup> <sup>1</sup> <sup>−</sup> *K*−<sup>1</sup> *<sup>k</sup>*=<sup>1</sup> *pk* <sup>∈</sup> *(*0*,* <sup>1</sup>*)*.

From the first score equation we see that we receive the classical MLE/GLM framework, and all tools introduced above for parameter estimation can directly be used. The only part that changes are the weights *vi* → *viZ (t ) i,k* . In the homogeneous case, i.e., in the null model we have MLE after the *t*-th iteration of the EM algorithm

$$
\widehat{\theta}\_k^{(l)} = h\_k \left( \frac{\sum\_{l=1}^n \upsilon\_l \widehat{Z}\_{l,k}^{(l)} Y\_l}{\sum\_{l=1}^n \upsilon\_l \widehat{Z}\_{l,k}^{(l)}} \right),
$$

where *hk* is the canonical link that corresponds to cumulant function *κk*.

If we choose the null model for the mixture probabilities we receive MLEs

$$
\widehat{p\_k}^{(l)} = \frac{1}{n} \sum\_{l=1}^{n} \widehat{Z}\_{l,k}^{(l)} \qquad \text{for } 1 \le k \le K. \tag{6.31}
$$

In Sect. 6.3.4, below, we will present an example that uses the null model for the mixture probabilities *p*, and we present an other example that uses a logistic categorical GLM for these mixture probabilities.

**Justification of the EM Algorithm** So far, we have neither given any argument why the EM algorithm is reasonable for parameter estimation nor have we said anything about convergence. The purpose of this paragraph is to justify the above EM algorithm. We aim at solving the incomplete log-likelihood maximization problem, see (6.24),

$$(\widehat{\boldsymbol{\theta}}^{\text{MLE}}, \widehat{\boldsymbol{p}}^{\text{MLE}}) \;= \underset{(\boldsymbol{\theta}, \boldsymbol{p})}{\text{arg}\max} \,\ell\_{\boldsymbol{Y}}(\boldsymbol{\theta}, \boldsymbol{p}) \;= \underset{(\boldsymbol{\theta}, \boldsymbol{p})}{\text{arg}\max} \sum\_{i=1}^{n} \log \left( \sum\_{k=1}^{K} p\_{k} f\_{k}(Y\_{i}; \boldsymbol{\theta}\_{k}, v\_{i}/\varphi\_{k}) \right),$$

subject to existence and uniqueness. We introduce some notation. Let *f (y, z*; *θ, p)* = exp{*(y,z)(θ, p)*} be the joint density of *(Y,Z)* and let *f (y*; *θ, p)* = exp{*y (θ, p)*} be the marginal density of *Y* . This allows us to rewrite the incomplete log-likelihood as follows for any value of *z*

$$\ell\_Y(\theta, p) = \log f(Y; \theta, p) = \log \left( \frac{f(Y, z; \theta, p)}{f(z|Y; \theta, p)} \right),$$

thus, we bring in the complete log-likelihood by using Bayes' rule. Choose an arbitrary categorical distribution *π* ∈ *K* with *K* levels. We have using the previous step

$$\ell\_Y(\boldsymbol{\theta}, \boldsymbol{p}) = \log f(Y; \boldsymbol{\theta}, \boldsymbol{p}) = \sum\_z \pi(z) \log f(Y; \boldsymbol{\theta}, \boldsymbol{p})$$

$$= \sum\_z \pi(z) \log \left( \frac{f(Y, z; \boldsymbol{\theta}, \boldsymbol{p})/\pi(z)}{f(z|Y; \boldsymbol{\theta}, \boldsymbol{p})/\pi(z)} \right)$$

$$= \sum\_z \pi(z) \log \left( \frac{f(Y, z; \boldsymbol{\theta}, \boldsymbol{p})}{\pi(z)} \right) + \sum\_z \pi(z) \log \left( \frac{\pi(z)}{f(z|Y; \boldsymbol{\theta}, \boldsymbol{p})} \right)$$

$$= \sum\_z \pi(z) \log \left( \frac{f(Y, z; \boldsymbol{\theta}, \boldsymbol{p})}{\pi(z)} \right) + D\_{\text{KL}}(\pi || f(\cdot|Y; \boldsymbol{\theta}, \boldsymbol{p})) \tag{6.32}$$

$$\ge \sum\_z \pi(z) \log \left( \frac{f(Y, z; \boldsymbol{\theta}, \boldsymbol{p})}{\pi(z)} \right),$$

the inequality follows because the KL divergence is always non-negative, see Lemma 2.21. This provides us with a lower bound for the incomplete log-likelihood *Y (θ, p)* for any categorical distribution *π* ∈ *K* and any *(θ, p)* ∈ × *K*:

$$\ell\_Y(\theta, \mathfrak{p}) \ge \sum\_z \pi(\mathfrak{z}) \log \left( \frac{f(Y, z; \theta, \mathfrak{p})}{\pi(\mathfrak{z})} \right) \tag{6.33}$$

$$= \mathbb{E}\_{\mathbf{Z} \sim \mathfrak{n}} \left[ \ell\_{(Y, \mathfrak{Z})}(\theta, \mathfrak{p}) \Big| \, Y \right] - \sum\_z \pi(\mathfrak{z}) \log(\pi(\mathfrak{z})) \stackrel{\text{def.}}{=} \mathcal{Q}(\theta, \mathfrak{p}; \pi).$$

Thus, we have a lower bound *Q(θ, p*; *π )* on the incomplete log-likelihood *Y (θ, p)*. This lower bound is based on the conditionally expected complete log-likelihood *(Y,Z)(θ, p)*, given *Y* , and under an arbitrary choice *π* for *Z*. The difference between this arbitrary *π* and the true conditional posterior distribution is given by the KL divergence *D*KL *(π*||*f (*·|*Y* ; *θ, p))*, see (6.32).

The general idea of the EM algorithm is to make this lower bound *Q(θ, p*; *π )* as large as possible in *θ*, *p* and *π* by iterating the following two alternating steps for *t* ≥ 1:

$$\widehat{\pi}^{(t)} = \underset{\pi}{\text{arg}\, \text{max}}\, \mathbb{Q}\left(\widehat{\boldsymbol{\theta}}^{(t-1)}, \widehat{\boldsymbol{p}}^{(t-1)}; \pi\right),\tag{6.34}$$

$$(\widehat{\boldsymbol{\theta}}^{(t)}, \widehat{\boldsymbol{p}}^{(t)}) = \underset{\boldsymbol{\theta}, \boldsymbol{p}}{\text{arg}\max} \, \mathcal{Q}\left(\boldsymbol{\theta}, \, \boldsymbol{p}; \, \widehat{\boldsymbol{\pi}}^{(t)}\right). \tag{6.35}$$

The first step (6.34) can be solved explicitly and it results in the E-step. Namely, from (6.32) we see that maximizing *<sup>Q</sup>( <sup>θ</sup> (t*−1*) , <sup>p</sup>(t*−1*)* ; *π )* in *π* is equivalent to minimizing the KL divergence *<sup>D</sup>*KL*(π*||*f (*·|*<sup>Y</sup>* ; *<sup>θ</sup> (t*−1*) , <sup>p</sup>(t*−1*) ))* in *π* because the left-hand side of (6.32) is independent of *π*. Thus, we have to solve

$$\widehat{\pi}^{(t)} = \operatorname\*{arg\,max}\_{\pi} \mathcal{Q}\left(\widehat{\boldsymbol{\theta}}^{(t-1)}, \widehat{\boldsymbol{p}}^{(t-1)}; \pi\right) \\ = \operatorname\*{arg\,min}\_{\pi} D\_{\text{KL}}\left(\pi \left\| \boldsymbol{f}(\cdot|\boldsymbol{Y}; \widehat{\boldsymbol{\theta}}^{(t-1)}, \widehat{\boldsymbol{p}}^{(t-1)}) \right\|.\right).$$

This optimization is solved by choosing the density *<sup>π</sup>(t )* <sup>=</sup> *f (*·|*<sup>Y</sup>* ; *<sup>θ</sup> (t*−1*) , <sup>p</sup>(t*−1*) )*, see Lemma 2.21, and this gives us exactly (6.28) if we calculate the corresponding conditional expectation of the latent variable *Z*. Moreover, importantly, this step provides us with an identity in (6.33):

$$\ell\_Y(\widehat{\boldsymbol{\theta}}^{(t-1)}, \widehat{\boldsymbol{p}}^{(t-1)}) = \mathcal{Q}\left(\widehat{\boldsymbol{\theta}}^{(t-1)}, \widehat{\boldsymbol{p}}^{(t-1)}; \widehat{\boldsymbol{\pi}}^{(t)}\right). \tag{6.36}$$

The second step (6.35) then increases the right-hand side of (6.36). This second step is equivalent to

$$\langle \widehat{\boldsymbol{\theta}}^{(t)}, \widehat{\boldsymbol{\p}}^{(t)} \rangle = \mathop{\arg\max}\_{\boldsymbol{\theta}, \boldsymbol{p}} \mathcal{Q} \left( \boldsymbol{\theta}, \boldsymbol{p}; \widehat{\boldsymbol{\pi}}^{(t)} \right) \\ = \mathop{\arg\max}\_{\boldsymbol{\theta}, \boldsymbol{p}} \mathbb{E}\_{\mathbf{Z} \sim \widehat{\boldsymbol{\pi}}^{(t)}} \left[ \ell\_{(Y, \mathbf{Z})} (\boldsymbol{\theta}, \boldsymbol{p}) \, \big|\, Y \right], \tag{6.37}$$

and this maximization is solved by the solution of the score equations (6.29)–(6.30) of the M-step. In this step we explicitly use the linearity in *Z* of the log-likelihood *(Y,<sup>Z</sup>)*, which allows us to calculate the objective function in (6.37) explicitly resulting in replacing *<sup>Z</sup>* by *<sup>Z</sup>(t )*. For other incomplete data problems, where we do not have this linearity, this step will be more complicated.

Summarizing, alternating optimizations (6.34) and (6.35) gives us a sequence of parameters*( <sup>θ</sup> (t ), <sup>p</sup>(t ))t*≥<sup>0</sup> with monotonically increasing incomplete log-likelihoods

$$\dots \dots \le \ell\_Y(\widehat{\boldsymbol{\theta}}^{(t-1)}, \widehat{\boldsymbol{p}}^{(t-1)}) \le \ell\_Y(\widehat{\boldsymbol{\theta}}^{(t)}, \widehat{\boldsymbol{p}}^{(t)}) \le \ell\_Y(\widehat{\boldsymbol{\theta}}^{(t+1)}, \widehat{\boldsymbol{p}}^{(t+1)}) \le \dots \tag{6.38}$$

Therefore, the EM algorithm converges supposed that the incomplete log-likelihood *Y (θ, p)* is a bounded function.

#### *Remarks 6.12*


## *6.3.4 Lab: Mixture Distribution Applications*

In this section we are going to present different mixture distribution examples that use the EM algorithm for parameter estimation. On the one hand this illustrates the functioning of the EM algorithm, and on the other hand it also highlights pitfalls that need to be avoided.

*Example 6.13 (Gaussian Mixture)* We directly fit a mixture model to the observation *Y* = *(Y*1*,...,Yn)*-. Assume that the log-likelihood of *Y* is given by a mixture of two Gaussian distributions

$$\ell\_Y(\theta, \sigma, p) = \sum\_{i=1}^n \log \left( \sum\_{k=1}^2 p\_k \frac{1}{\sqrt{2\pi}\sigma\_k} \exp \left\{ -\frac{1}{2\sigma\_k^2} (Y\_i - \theta\_k)^2 \right\} \right),$$

with *p* ∈ 2, mean vector *θ* = *(θ*1*, θ*2*)*- <sup>∈</sup> <sup>R</sup><sup>2</sup> and standard deviations *<sup>σ</sup>* <sup>=</sup> *(σ*1*, σ*2*)*- <sup>∈</sup> <sup>R</sup><sup>2</sup> <sup>+</sup>. Choose estimate *<sup>θ</sup>*<sup>1</sup> <sup>=</sup> *<sup>Y</sup>*1, then we have

$$\lim\_{\sigma\_1 \to 0} \frac{1}{\sqrt{2\pi}\sigma\_1} \exp\left\{ -\frac{1}{2\sigma\_1^2} (Y\_1 - \widehat{\theta}\_1)^2 \right\} = \lim\_{\sigma\_1 \to 0} \frac{1}{\sqrt{2\pi}\sigma\_1} = \infty.$$

For any *<sup>i</sup>* = 1 we have *Yi* = *<sup>θ</sup>*<sup>1</sup> (note that the Gaussian distribution is absolutely continuous and observations are distinct, a.s.). Henceforth for *i* = 1

$$\lim\_{\sigma\_l \to 0} \frac{1}{\sqrt{2\pi}\sigma\_l} \exp\left\{ -\frac{1}{2\sigma\_l^2} (Y\_l - \widehat{\theta}\_l)^2 \right\} = \lim\_{\sigma\_l \to 0} \frac{1}{\sqrt{2\pi}} \exp\left\{ -\frac{1}{2\sigma\_l^2} (Y\_l - \widehat{\theta}\_l)^2 - \log \sigma\_l \right\} = 0.1$$

If we choose any *<sup>θ</sup>*<sup>2</sup> <sup>∈</sup> <sup>R</sup>, *<sup>p</sup>* <sup>∈</sup> <sup>2</sup> and *<sup>σ</sup>*<sup>2</sup> *<sup>&</sup>gt;* 0, we receive for *<sup>θ</sup>*<sup>1</sup> <sup>=</sup> *<sup>Y</sup>*<sup>1</sup>

$$\begin{split} \lim\_{\sigma\_{1}\to 0} \ell\_{Y}(\widehat{\boldsymbol{\theta}}, \boldsymbol{\sigma}, \boldsymbol{p}) &= \lim\_{\sigma\_{1}\to 0} \log \left( \sum\_{k=1}^{2} p\_{k} \frac{1}{\sqrt{2\pi}\sigma\_{k}} \exp \left\{ -\frac{1}{2\sigma\_{k}^{2}} (Y\_{1} - \widehat{\theta}\_{k})^{2} \right\} \right) \\ &+ \sum\_{l=2}^{n} \log \left( \frac{p\_{2}}{\sqrt{2\pi}\sigma\_{2}} \right) - \frac{1}{2\sigma\_{2}^{2}} (Y\_{l} - \widehat{\theta}\_{2})^{2} = \infty. \end{split}$$

Thus, we can make the log-likelihood of this mixture Gaussian model arbitrarily large by fitting a degenerate Gaussian model to one observation in one mixture component, and letting the remaining observations be described by the other mixture component. This shows that the MLE problem may not be well-posed for mixture distributions because the log-likelihood can be unbounded.

If the data has well separated clusters, the log-likelihood of a mixture Gaussian distribution will have multiple local maximums. One can construct for any given number *<sup>B</sup>* <sup>∈</sup> <sup>N</sup> a data set *<sup>Y</sup>* such that the number of local maximums exceeds this number *B*, see Theorem 3 in Améndola et al. [11]. -

*Example 6.14 (Gamma Claim Size Modeling)* In this example we consider claim size modeling of the French MTPL example given in Chap. 13.1. In view of Fig. 13.15 this seems quite difficult because we have three modes and heavytailedness. We choose a mixture of 5 distribution functions, we choose four gamma distributions and the Lomax distribution

$$Y \sim \sum\_{k=1}^{4} \left( p\_k \frac{\beta\_k^{\alpha\_k}}{\Gamma(\alpha\_k)} \mathbf{y}^{\alpha\_k - 1} \exp\{-\beta\_k \mathbf{y}\} \right) + p\_5 \, \frac{\beta\_5}{M} \left( \frac{\mathbf{y} + M}{M} \right)^{-(\beta \varepsilon + 1)}, (6.39)$$

with shape parameters *αk* and scale parameters *βk*, 1 ≤ *k* ≤ 4, for the gamma densities; scale parameter *M* and tail parameter *β*<sup>5</sup> for the Lomax density; and with mixture probability *p* ∈ 5. The idea behind this choice is that three gamma distributions take care of the three modes of the empirical density, see Fig. 13.15, the fourth gamma distribution models the remaining claims in the body of the distribution, and the Lomax distribution takes care of the regularly varying tail of the data. For the gamma distribution, we refer to Sect. 2.1.3, and for the Lomax distribution, we refer to Sect. 2.2.5.

We choose the null model for both the mixture probabilities *p* ∈ <sup>5</sup> and the densities *fk* , 1 ≤ *k* ≤ 5. This model can directly be fitted with the EM algorithm as presented above, in particular, we can estimate the mixture probabilities by (6.31). The remaining shape, scale and tail parameters are directly estimated by MLE. To initialize the EM algorithm we use the interpretation of the components as explained above. We partition the entire data into *K* = 5 bins according to their claim sizes *Yi* being in *(*0*,* 300], *(*300*,* 1 000], *(*1 000*,* 1 200], *(*1 200*,* 5 000] or *(*5 000*,*∞*)*. The first three intervals will initialize the three modes of the empirical density, see Fig. 13.15 (lhs). This will correspond to the categorical variable taking values *Z* = 1*,* 2*,* 3; the fourth interval will correspond to *Z* = 4 and it will model the main body of the claims; and the last interval will correspond to *Z* = 5, modeling the Lomax tail of the claims. These choices provide the initialization given in Table 6.1 with upper indices *(*0*)* . We remark that we choose a fixed threshold of *M* = 2 000 for the Lomax distribution, this choice will be further discussed below.

Based on these choices we run the EM algorithm for mixture distributions. We observe convergence after roughly 80 iterations, and the resulting parameters after 100 iterations are presented in Table 6.1. We observe rather large shape parameters *<sup>α</sup>(*100*) <sup>k</sup>* for the first three components *k* = 1*,* 2*,* 3. This indicates that these three components model the three modes of the empirical density and these three modes collect almost *<sup>p</sup> (*100*)* <sup>1</sup> <sup>+</sup> *<sup>p</sup> (*100*)* <sup>2</sup> <sup>+</sup> *<sup>p</sup> (*100*)* <sup>3</sup> ≈ 50% of all claims. The remaining claims are modeled by the gamma density *k* = 4 having mean 1'304 and by the Lomax distribution having tail parameter *β (*100*)* <sup>5</sup> = 1*.*416, thus, this tail has finite first moment *M/(β (*100*)* <sup>5</sup> − 1*)* = 4 812 and infinite second moment.


**Table 6.1** Parameter choices in the mixture model (6.39)

Figure 6.11 shows the resulting estimated mixture distribution. It gives the individual mixture components (top-lhs), the resulting mixture density (top-rhs), the QQ plot (bottom-lhs) and the log-log plot (bottom-rhs). Overall we find a rather good fit; maybe the first mode is a bit too spiky. However, this plot may also be misleading because the empirical density plot relies on kernel smoothing having a given bandwidth. Thus, the true observations may be more spiky than the plot indicates. The third mode suggests that there are two different values in the observations around 1'100, this is also visible in the QQ plot. Nevertheless, the overall result seems satisfactory. These results (based on 13 estimated parameters) are also summarized in Table 6.2.

We mention a couple of limitations of these results. Firstly, the log-likelihood of this mixture model is unbounded, similarly to Example 6.13 we can precisely fit one degenerate gamma mixture component to an individual observation *Yi* which results in an infinite log-likelihood value. Thus, the found solution corresponds to a local maximum of the log-likelihood function and we should not state AIC values in Table 6.2, see also Remarks 4.28. Secondly, it is crucial to initialize three components to the three modes, if we randomly allocate all claims to 5 bins as initial configuration, the EM algorithm only finds mode *Z* = 3 but not necessarily the first two modes, at least, in our specifically chosen random initialization this was the case. In fact, the likelihood value of our latter solution was worse than in the first calibration which shows that we ended up in a worse local maximum.

We may be tempted to also estimate the Lomax threshold *M* with MLE. In Fig. 6.12 we plot the maximal log-likelihood as a function of *M* (if we start the EM algorithm always in the same configuration given in Table 6.1). From this figure a threshold of *M* = 1 600 seems optimal. Choosing this threshold of *M* = 1 600 leads to a slightly bigger log-likelihood of −199'304 and a slightly smaller tail parameter of *β (*100*)* <sup>5</sup> = 1*.*318. However, overall the model is very similar to the one with *M* = 2 000. In general, we do *not* recommend to estimate *M* with MLE, but this should be treated as a hyper-parameter selected by the modeler. The reason for this recommendation is that this threshold is crucial in deciding for large claims modeling and its estimation from data is, typically, not very robust; we also refer to Remarks 6.15, below.

**Fig. 6.11** Mixture null model: (top-lhs) individual estimated gamma components *fk(*·; *<sup>α</sup>(*100*) <sup>k</sup> , β (*100*) <sup>k</sup> )*, 1 ≤ *k* ≤ *K*, and Lomax component *f*5*(*·; *β (*100*)* <sup>5</sup> *)*, (top-rhs) estimated mixture density <sup>4</sup> *<sup>k</sup>*=<sup>1</sup> *<sup>p</sup> (*100*) <sup>k</sup> fk(*·; *<sup>α</sup>(*100*) <sup>k</sup> , β (*100*) <sup>k</sup> )* <sup>+</sup> *<sup>p</sup> (*100*)* <sup>5</sup> *f*5*(*·; *β (*100*)* <sup>5</sup> *)*, (bottom-lhs) QQ plot of the estimated model, (bottom-rhs) log-log plot of the estimated model

**Table 6.2** Mixture models for French MTPL claim size modeling


In a next step we enhance the mixture modeling by including feature information *x<sup>i</sup>* to explain the responses *Yi*. In view of Fig. 13.17 we have decided to only model the mixture probabilities *p* = *p(x)* feature dependent because feature information seems to mainly influence the heights of the peaks. We do not consider features VehPower and VehGas because these features do not seem to contribute, and

we do not consider Density because of the high co-linearity with Area, see Fig. 13.12 (rhs). Thus, we are left with the features Area, VehAge, DrivAge, BonusMalus, VehBrand and Region. Pre-processing of these features is done as in Listing 5.1, except that we keep Area categorical. Using these features *<sup>x</sup>* <sup>∈</sup> *<sup>X</sup>* ⊂ {1} × <sup>R</sup>*<sup>q</sup>* we choose a logistic categorical GLM for the mixture probabilities

$$\mathbf{x} \mapsto \begin{pmatrix} p\_1(\mathbf{x}), \dots, p\_{K-1}(\mathbf{x}) \end{pmatrix}^{\top} = \frac{\exp\{X\mathbf{y}\}}{1 + \sum\_{l=1}^{4} \exp\{\mathbf{y}\_l, \mathbf{x}\}},\tag{6.40}$$

that is, we choose *<sup>K</sup>* <sup>=</sup> 5 as reference level, feature matrix *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>*(K*−1*)*×*(K*−1*)(q*+1*)* is defined in (5.71), and with regression parameter *γ* = *(γ* - <sup>1</sup> *,..., γ* - *<sup>K</sup>*−1*)*- ∈ R*(K*−1*)(q*+1*)* ; this regression parameter *γ* should not be confused with the shape parameters *β*1*,...,β*<sup>4</sup> of the gamma components and the tail parameter *β*<sup>5</sup> of the Lomax component, see (6.39). Note that the notation in this section slightly differs from Sect. 5.7 on the logistic categorical GLM. In this section we consider mixture probabilities *p(x)* ∈ *K*=<sup>5</sup> (which corresponds to one-hot encoding), whereas in Sect. 5.7 we model *(p*1*(x), . . . , pK*−1*(x))* with a categorical GLM (which corresponds to dummy coding), and normalization provides us with *pK(x)* = <sup>1</sup> <sup>−</sup> *K*−<sup>1</sup> *<sup>l</sup>*=<sup>1</sup> *pl(x)* <sup>∈</sup> *(*0*,* <sup>1</sup>*)*.

This logistic categorical GLM requires that we replace in the M-step the probability estimation (6.31) by Fisher's scoring method for GLMs as outlined in Sect. 5.7.2, but there is a small difference to that section. In the working residuals (5.74) we use dummy coding *T (Z)* ∈ {0*,* <sup>1</sup>}*K*−<sup>1</sup> of a categorical variable *Z*, this now needs to be replaced by the estimated vector *(Z* <sup>1</sup>*(θ, <sup>p</sup>*|*Y ), . . . ,Z <sup>K</sup>*−1*(θ, <sup>p</sup>*|*Y ))*- <sup>∈</sup> *(*0*,* <sup>1</sup>*)K*−<sup>1</sup> which is used as an estimate for the latent variable *T (Z)*. Apart from that everything is done as described in Sect. 5.7.2; in R this can be done with the procedure multinom from the package nnet [368]. We start the EM algorithm exactly in the final configuration of the


**Table 6.3** Parameter choices in the mixture models: upper part null model, lower part GLM for estimated mixture probabilities *<sup>p</sup>(xi)*

estimated mixture null model, and we run this algorithm for 20 iterations (which provides convergences).

The resulting parameters are given in the lower part of Table 6.3. We observe that the resulting parameters remain essentially the same, the second mode *Z* = 2 is a bit less spiky, and the tail parameter is slightly smaller. The summary of this model is given on the last line of Table 6.2. Regression modeling adds another 4·45 = 180 parameters to the model because we have *q* = 45 feature components in *x* (different from the intercept component). In view of AIC we give preference to the logistic mixture probability case (though AIC has to be interpreted with care, here, because we do not consider the MLE but rather a local maximum).

Figure 6.13 plots the individual estimated mixture probabilities *<sup>x</sup><sup>i</sup>* <sup>→</sup> *<sup>p</sup>(xi)* <sup>∈</sup> <sup>5</sup> over the insurance policies 1 ≤ *i* ≤ *n*; these plots are inspired by the thesis of Frei [138]. The upper plots consider these probabilities against the estimated claim sizes *μ(xi)* <sup>=</sup> <sup>5</sup> *<sup>k</sup>*=<sup>1</sup> *<sup>p</sup> k(xi) μk* and the lower plots against the ranks of *μ(xi)*, the latter gives a different scaling on the *x*-axis because of the heavy-tailedness of the claims. The plots on the left-hand side show all individual policies 1 ≤ *i* ≤ *n*, and the plots on the right-hand side show a quadratic spline fit to these observations. Not surprisingly, we observe that the claim size estimate *μ(xi)* is mainly driven by the large claims probability *<sup>p</sup>* <sup>5</sup>*(xi)* describing the Lomax contribution.

In Fig. 6.14 we compare the QQ plots of the mixture null model and the one where we model the mixture probabilities with the logistic categorical GLM. We see that the latter (more complex) model clearly outperforms the more simple one, in fact, this QQ plot looks quite convincing for the French MTPL claim size data. Finally, we perform a Wald test (5.32). We simultaneously treat all parameters that belong to the same feature variable (similar to the ANOVA analysis); for instance, for the 22 Regions the corresponding part of the regression parameter *γ* contains 4 · 21 = 84 components. The resulting *p*-values of dropping such components are all close to 0 which says that we should not eliminate one of the feature variables. This closes the example. -

**Fig. 6.13** Mixture probabilities *<sup>x</sup><sup>i</sup>* <sup>→</sup> *<sup>p</sup>(xi)* on individual policies 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*: (top) against the estimated means *μ(xi)* and (bottom) against the ranks of the estimated means *μ(xi)*; (lhs) over policies 1 ≤ *i* ≤ *n* and (rhs) quadratic spline fit

#### *Remarks 6.15*

• In Example 6.14 we have chosen a mixture distribution with four gamma components and one Lomax component. The reason for choosing the Lomax component has been two-fold. Firstly, we need a regularly varying tail to model the heavy-tailed property of the data. Secondly, we have preferred the Lomax distribution over the Pareto distribution because this provides us with a continuous density in (6.39). The results in Example 6.14 have been satisfactory. In most practical approaches, however, this approach will not work, even when fixing the threshold *M* of the Lomax component. Often, the nature of the data is such that the chosen gamma mixture distribution is not able to fully explain the small data in the body of the distribution, and in that situation the Lomax tail will assist in fitting the small claims. The typical result is that the Lomax part

**Fig. 6.14** QQ plots of the mixture models: (lhs) null model and (rhs) logistic categorical GLM for mixture probabilities

then pays more attention to small claims (through the log-likelihood function of numerous small claims) and the fitting of the tail turns out to be poor (because a few large claims do not sufficiently contribute to the log-likelihood). There are two ways to solve this dilemma. Either one works with composite distributions, see (6.56) below, and one drops the continuity property of the density; this is the approach taken in Fung et al. [148]. Or one fits the Lomax distribution solely to large observations in a first step, and then fixes the parameters of the Lomax distribution during the second step when fitting the full model to all data, this is the approach taken in Frei [138]. Both of these two approaches have been providing good results on real insurance data.


## **6.4 Truncated and Censored Data**

## *6.4.1 Lower-Truncation and Right-Censoring*

A common problem in insurance is that we often have truncated or censored observations. Truncation naturally occurs if we sell insurance products that have a deductible *d >* 0 because in that case only the insurance claim *(Y* − *d)*<sup>+</sup> is compensated, and claims below the deductible *d* are usually not reported to the insurance company. This case is called *lower-truncation*, because claims below the deductible are not observed. If we lower-truncate an original claim *Y* ∼ *f (*·; *θ )* with lower-truncation point *<sup>τ</sup>* <sup>∈</sup> <sup>R</sup> we obtain the density

$$f\_{\left(\mathbf{r},\infty\right)}\left(\mathbf{y};\theta\right) = \frac{f(\mathbf{y};\theta)\mathbb{1}\_{\{\mathbf{y} > \mathbf{r}\}}}{1 - F(\mathbf{r},\theta)},\tag{6.41}$$

if *F (*·; *θ )* is the distribution function corresponding to the density *f (*·; *θ )*. The lower-truncated density *f(τ,*∞*)(y*; *θ )* only considers claims that fall into the interval *(τ,*∞*)*. Obviously, we can define upper-truncation completely analogously by considering an interval *(*−∞*, τ* ] instead. Figure 6.15 (lhs) gives an example of a lower-truncated density, and Fig. 6.15 (rhs) gives an example of a lower- and uppertruncated density.

Censoring occurs by selling insurance products with a maximal cover *M >* 0 because in that case only the insurance claim *Y* ∧ *M* = min{*Y,M*} is compensated, and the exact claim size above the maximal cover *M* may not be available. This case is called *right-censoring* because the exact claim amount above *M* is not known. Right-censoring of an original claim *<sup>Y</sup>* <sup>∼</sup> *F (*·; *θ )* with censoring point *<sup>M</sup>* <sup>∈</sup> <sup>R</sup>

**Fig. 6.15** (lhs) Lower-truncated gamma density with *τ* = 2 000, and (rhs) lower- and uppertruncated gamma density with truncation points 2 000 and 6 000

**Fig. 6.16** (lhs) Right-censored gamma distribution with *M* = 6 000, and (rhs) left- and rightcensored gamma distribution with censoring points 2 000 and 6 000

gives the distribution

$$F\_{Y \wedge M}(\mathbf{y}; \theta) = F(\mathbf{y}; \theta) \mathbb{1}\_{\{\mathbf{y} < M\}} + \mathbb{1}\_{\{\mathbf{y} \ge M\}},$$

that is, we have a point mass in the censoring point *M*. We can define left-censoring analogously by considering the claim *Y* ∨*M* = max{*Y,M*}. Figure 6.16 (lhs) shows a right-censored gamma distribution with censoring point *M* = 6 000, and Fig. 6.16 (rhs) shows a left- and right-censored example with censoring points 2 000 and 6 000.

Often in re-insurance, deductibles (also called retention levels) and maximal covers are combined, for instance, an excess-of-loss (XL) insurance cover of size *u >* 0 above the retention level *d >* 0 covers the claim

$$\mu(Y - d)\_+ \wedge \mu = (Y - d)\mathbb{1}\_{\{d \le Y < d + \mu\}} + \mu \mathbb{1}\_{\{Y \ge d + \mu\}} = (Y - d)\_+ - (Y - (d + \mu))\_+ \dots$$

Obviously, truncation and censoring pose some challenges in regression modeling because at the same time we need to consider the density *f (*·; *θ )* and the distribution function *F (*·; *θ )* to estimate a parameter *θ*. Both cases can be understood as missing data problems, with censoring providing the number of claims but not necessarily the exact claim size, and with truncation leaving also the number of claims unknown. These two cases are studied in Fung et al. [147] within the mixture of experts models using a variant of the EM algorithm. We use their techniques within the EDF framework for right-censored or lower-truncated data. This is done in the next sections.

## *6.4.2 Parameter Estimation Under Right-Censoring*

Assume we have a fixed censoring point *M >* 0 that applies to independent observations *Yi* following EDF densities *f (*·; *θi, vi/ϕ)*; for simplicity we assume to work with an absolutely continuous EDF in this section. The (incomplete) loglikelihood function of canonical parameters *θ* = *(θi)*<sup>1</sup>≤*i*≤*<sup>n</sup>* for observations *Y* ∧ *M* is given by

$$\ell\_{Y \wedge M}(\boldsymbol{\theta}) = \sum\_{l \colon Y\_l < M} \log f(Y\_l; \theta\_l, v\_l/\varphi) + \sum\_{l \colon Y\_l \wedge M = M} \log \left( 1 - F(M; \theta\_l, v\_l/\varphi) \right). \tag{6.42}$$

We interpret this as an incomplete data problem because the claim sizes *Yi* above the censoring point *M* are not known. The complete log-likelihood is given by

$$\ell\_Y(\theta) = \sum\_{i=1}^n \log f(Y\_i; \theta\_i, v\_i/\varphi).$$

Similarly to (6.32) we calculate a lower bound to the incomplete log-likelihood. We focus on one component of *Y* and drop the lower index *i* in *Yi* for this consideration. Firstly, if *Y* ∧ *M<M* we are in the situation of full claim size information and, obviously, we have log-likelihood in that case *Y <M*

$$\ell\_{Y \wedge M}(\theta) = \ell\_Y(\theta) = \frac{Y\theta - \kappa(\theta)}{\varphi/v} + a(Y; v/\varphi). \tag{6.43}$$

In the second case *Y* ∧ *M* = *M* we do not have precise claim size information. In that case we have conditional density of claim *Y* |{*<sup>Y</sup>* <sup>∧</sup>*M*=*M*} = *Y* |{*Y*≥*M*} above *M*

$$f(\boldsymbol{z}|\boldsymbol{Y}\geq M;\theta,\boldsymbol{v}/\varphi) = \frac{f(\boldsymbol{z};\theta,\boldsymbol{v}/\varphi)\mathbb{1}\_{\{\boldsymbol{\varepsilon}\geq M\}}}{1 - F(M;\theta,\boldsymbol{v}/\varphi)} = \frac{f(\boldsymbol{z};\theta,\boldsymbol{v}/\varphi)\mathbb{1}\_{\{\boldsymbol{\varepsilon}\geq M\}}}{\exp\{\ell\_{Y\wedge M}(\theta)\}},\qquad(6.44)$$

the latter follows because *Y* ∧*M* = *M* has the corresponding point mass in censoring point *M* (we work with an absolutely continuous EDF here). Choose an arbitrary density *π* having the same support as *Y* |{*Y*≥*M*}, and consider a random variable *Z* ∼ *π*. Using (6.44) and the EDF structure on the last line, we have for *Y* ≥ *M*

$$\begin{aligned} \ell\_{Y \wedge M}(\theta) &= \int \pi(z) \, \ell\_{Y \wedge M}(\theta) \, d\nu(z) \\ &= \int \pi(z) \log \left( \frac{f(z; \theta, \upsilon/\varphi)/\pi(z)}{f(z|Y \ge M; \theta, \upsilon/\varphi)/\pi(z)} \right) d\upsilon(z) \\ &= \int \pi(z) \log \left( \frac{f(z; \theta, \upsilon/\varphi)}{\pi(z)} \right) d\upsilon(z) + D\_{\text{KL}}\left(\pi||f(\cdot|Y \ge M; \theta, \upsilon/\varphi)\right). \end{aligned}$$

$$\begin{split} & \mathbb{E} \ge \int \pi(z) \log \left( \frac{f(z;\theta,\upsilon/\varphi)}{\pi(z)} \right) d\upsilon(z) \\ &= \frac{\mathbb{E}\_{\pi} \left[ Z \right] \theta - \kappa(\theta)}{\varphi/\upsilon} + \mathbb{E}\_{\pi} \left[ a(Z;\upsilon/\varphi) \right] - \mathbb{E}\_{\pi} \left[ \log \pi(Z) \right] \overset{\text{def.}}{=} \mathcal{Q}(\theta;\pi). \end{split}$$

This allows us to explore the E-step and the M-step similarly to (6.34) and (6.35).

The **E-step** in the case *<sup>Y</sup>* <sup>≥</sup> *<sup>M</sup>* for given canonical parameter estimate *<sup>θ</sup>(t*−1*)* reads as

$$\begin{split} \widehat{\pi}^{(t)} &= \operatorname\*{arg\,max}\_{\pi} \mathcal{Q}\left(\widehat{\theta}^{(t-1)}; \pi\right) \\ &= \operatorname\*{arg\,min}\_{\pi} D\_{\text{KL}}\left(\pi \left\| \, f(\cdot|Y \ge M; \widehat{\theta}^{(t-1)}, v/\varphi) \right\| \right) \\ &= f(\cdot|Y \ge M; \widehat{\theta}^{(t-1)}, v/\varphi). \end{split}$$

This allows us to calculate the estimation of the claim size above *<sup>M</sup>*, i.e., under *<sup>π</sup>(t )*

$$\widehat{Y}^{(l)} = \mathbb{E}\_{\widehat{\pi}^{(l)}}\left[Z\right] = \int z \, f(z|Y \ge M; \widehat{\theta}^{(l-1)}, v/\varphi) \, d\nu(z). \tag{6.45}$$

Note that this is an estimate of the censored claim *Y* |{*Y*≥*M*}. This completes the E-step.

The **M-step** considers in the EDF case for censored claim sizes *Y* ≥ *M*

$$\begin{split} \widehat{\theta}^{(t)} &= \operatorname\*{arg\,max}\_{\theta} \mathcal{Q} \left( \theta; \widehat{\pi}^{(t)} \right) \\ &= \operatorname\*{arg\,max}\_{\theta} \ell\_{\widehat{Y}^{(t)}}(\theta), \end{split} \tag{6.46} \\ &= \operatorname\*{arg\,max}\_{\theta} \ell\_{\widehat{Y}^{(t)}}(\theta), \tag{6.46} $$

the latter uses that the normalizing term *a(*·; *v/ϕ)* is not relevant for the MLE of *θ*. That is, (6.46) describes the regular MLE step under the observation *Y (t )* in the case of a censored observation *Y* ≥ *M*; and if *Y <M* we simply use the loglikelihood (6.43).

EM algorithm for right-censored data within the EDF

	- *(t*−1*)* <sup>=</sup> *( <sup>θ</sup>(t*−1*)*
		- **E-step.** Given parameter *<sup>θ</sup> <sup>i</sup> )*1≤*i*≤*n*, estimate for the rightcensored claims *Yi* ≥ *M* their sizes by, see (6.45),

$$
\widehat{Y}\_l^{(t)} = \int z \, f\left(z \, \middle| \, Y\_l \ge M; \, \widehat{\theta}\_l^{(t-1)}, v\_l/\varphi \right) d\nu(z).
$$

This provides us with an estimated observation

$$\widehat{Y}^{(t)} = \left( Y\_l \mathbf{1}\_{\{Y\_l < M\}} + \widehat{Y}\_l^{(t)} \mathbf{1}\_{\{Y\_l \ge M\}} \right)^l\_{1 \le i \le n}.$$

• **M-step.** Calculate the MLE *<sup>θ</sup> (t )* <sup>=</sup> *( <sup>θ</sup>(t ) <sup>i</sup> )*1≤*i*≤*<sup>n</sup>* based on observation *<sup>Y</sup>(t )*, i.e., solve

$$
\widehat{\boldsymbol{\theta}}^{(t)} = \underset{\boldsymbol{\theta}}{\text{arg}\max} \,\ell\_{\widehat{Y}^{(t)}}(\boldsymbol{\theta}).
$$

Note that the above EM algorithm uses that the log-likelihood *<sup>Y</sup> (θ)* of the EDF is linear in the observations that interact with parameter *θ*. We revisit the gamma claim size example of Sect. 5.3.7.

*Example 6.16 (Right-Censored Gamma Claim Sizes)* We revisit the gamma claim size GLM introduced in Sect. 5.3.7. The claim sizes are illustrated in Fig. 13.22. In total we have *n* = 656 observations *Yi*, and they range from 16 SEK to 211'254 SEK. We right-censor this data at *M* = 50 000, this results in 545 uncensored observations and 111 censored observations equal to *M*. Thus, for the 17% largest claims we assume to not have any knowledge about the exact claim sizes. We use the EM algorithm for right-censored data to fit a GLM to this problem.

In order to calculate the E-step we need to evaluate the conditional expectation (6.45) under the gamma model

$$
\widehat{Y}^{(l)} = \int z \, f(z|Y \ge M; \widehat{\theta}^{(l-1)}, v/\varphi) \, d\, v(z) \tag{6.47}
$$

$$
= \int\_M^\infty z \frac{\frac{\beta^\alpha}{\Gamma(\alpha)} z^{\alpha - 1} \exp\{-\beta z\}}{1 - \mathcal{G}(\alpha, \beta M)} \, dz = \frac{\alpha}{\beta} \frac{1 - \mathcal{G}(\alpha + 1, \beta M)}{1 - \mathcal{G}(\alpha, \beta M)},
$$

with shape parameter *<sup>α</sup>* <sup>=</sup> *v/ϕ*, scale parameter *<sup>β</sup>* = − *<sup>θ</sup>(t*−1*) v/ϕ*, see (5.45), and scaled incomplete gamma function

$$\mathcal{G}(\alpha, \mathbf{y}) = \frac{1}{\Gamma(\alpha)} \int\_0^\mathbf{y} z^{\alpha - 1} \exp\{-z\} \, dz \in (0, 1) \tag{6.48} \qquad \text{for } \mathbf{y} \in (0, \infty). \tag{6.48}$$

Thus, we receive a simple formula that allows us to efficiently calculate the Estep, and the M-step is exactly the gamma GLM explained in Sect. 5.3.7 for the (estimated) data *<sup>Y</sup>(t )*.

For the modeling we choose exactly the features as used for model Gamma GLM2, this gives *q* + 1 = 7 regression parameter components and additionally we set for the dispersion parameter *<sup>ϕ</sup>*MLE <sup>=</sup> <sup>1</sup>*.*427, this is the MLE in model Gamma


**Table 6.4** Comparison of the complete log-likelihood and the incomplete log-likelihood (rightcensoring *M* = 50 000) results

GLM2. This dispersion parameter we keep fixed in all our models studied in this example. In a first step we simply fit a gamma GLM to the right-censored data *Yi* ∧ *M*. We call this model 'crude GLM2', and it underestimates the empirical claim sizes by 28% because it ignores the fact of having right-censored data.

To initialize the EM algorithm for right-censored data we use the model crude GLM2. We then iterate the algorithm for 15 steps which provides convergence. The results are presented in Table 6.4. We observe that the resulting log-likelihood of the model fitted on the censored data and evaluated on the complete data *<sup>Y</sup>* (which is available here) is almost the same as for model Gamma GLM2, which has been estimated on the complete data. Moreover, this right-censored EM algorithm fitted model slightly over-estimates the average claim sizes.

Figure 6.17 shows the estimated means *μi* on an individual claims level. The *x*-axis always gives the estimates from the complete log-likelihood model Gamma GLM2. The *y*-axis on the left-hand side shows the estimates from the crude GLM and the right-hand side the estimates from the EM algorithm fitted counterpart (fitted on the right-censored data). We observe that the crude model underestimates the claims (being below the diagonal), and the largest estimate lies below *M* = 50 000

**Fig. 6.17** Comparison of the estimated means *μi* in model Gamma GLM2 against (lhs) the crude GLM and (rhs) the EM fitted right-censored model; both axis are on the log-scale, the dotted lines shows the censoring point log*(M)*

in our example (horizontal dotted line). The EM algorithm fitted model, considering the fact that we have right-censored data, corrects for the censoring, and the resulting estimates resemble the ones from the complete log-likelihood model quite well. In fact, we probably slightly over-estimate under right-censoring, here. Note that all these considerations have been done under an identical dispersion parameter estimate *<sup>ϕ</sup>*MLE. For the complete log-likelihood case, this is not really needed for mean estimation because it cancels in the score equations for mean estimation. However, a reasonable dispersion parameter estimate is crucial for the incomplete case as it enters *Y (t )* in the E-step, see (6.47), thus, the caveat here is that we need a reasonable dispersion estimate from the right-censored data (which we did not discuss, here, and which requires further research). -

## *6.4.3 Parameter Estimation Under Lower-Truncation*

Compared to censoring we have less information under truncation because not only the claim sizes below the lower-truncation point are unknown, but we also do not know how many claims there are below that truncation point *τ* . Assume we work with responses belonging to the EDF. The incomplete log-likelihood is given by

$$\ell\_{Y>\mathfrak{r}}(\theta) = \sum\_{i=1}^{n} \log f(Y\_i; \theta\_l, v\_l/\varphi) - \log \left( 1 - F(\mathfrak{r}; \theta\_l, v\_l/\varphi) \right),$$

assuming that *Y* = *(Yi)*<sup>1</sup>≤*i*≤*<sup>n</sup> > τ* collects all claims above the truncation point *Yi > τ* , see (6.41). We proceed as in Fung et al. [147] to construct a complete log-likelihood; there are different ways to do so, but this proposal is convenient for parameter estimation. Firstly, we equip each observed claim *Yi > τ* with an independent count random variable *Ki* ∼ *p(*·; *θi, vi/ϕ)* that determines the number of claims below the truncation point that correspond to claim *i* above the truncation point. Secondly, we assume that these claims are given by independent observations *Zi,*1*,...,Zi,Ki* ≤ *τ* , a.s., with a distribution obtained from an un-truncated version of *Yi*, i.e., we consider the upper-truncated version of *f (*·; *θi, vi/ϕ)* for *Zi,j* . This gives us the complete log-likelihood

$$\ell\_{(Y,K,\mathcal{Z})}(\boldsymbol{\theta}) = \sum\_{l=1}^{n} \left( \log \left( \frac{f(Y\_l; \theta\_l, v\_l/\varphi)}{1 - F(\mathbf{r}; \theta\_l, v\_l/\varphi)} \right) \right. \tag{6.49}$$

$$+ \log p(K\_l; \theta\_l, v\_l/\varphi) + \sum\_{j=1}^{K\_l} \log \left( \frac{f(Z\_{l,j}; \theta\_l, v\_l/\varphi)}{F(\mathbf{r}; \theta\_l, v\_l/\varphi)} \right) \Big),$$

with *K* = *(Ki)*<sup>1</sup>≤*i*≤*n*, and *Z* collects all (latent) claims *Zi,j* ≤ *τ* , an empty sum is set equal to zero. Next, we assume that *Ki* is following the geometric distribution

$$\mathbb{P}\_{\theta\_l}[K\_l = k] = \left(p(k; \theta\_l, v\_l/\varphi)\right) \\ = F(\mathbf{r}; \theta\_l, v\_l/\varphi)^k \left(1 - F(\mathbf{r}; \theta\_l, v\_l/\varphi)\right). \tag{6.50}$$

As emphasized in Fung et al. [147], this complete log-likelihood is an artificial construct that supports parameter estimation of lower-truncated data. It does *not* claim that the true un-truncated data follows this model (6.49) but it provides a distributional extension below the truncation point *τ >* 0 that is convenient for parameter estimation. Namely, inserting this geometric distribution assumption into (6.49) gives us complete log-likelihood

$$\ell\_{(Y,K,Z)}(\boldsymbol{\theta}) = \sum\_{l=1}^{n} \left( \log f(Y\_l; \theta\_l, v\_l/\varphi) + \sum\_{j=1}^{K\_l} \log f(Z\_{l,j}; \theta\_l, v\_l/\varphi) \right). \tag{6.51}$$

Within the EDF this allows us to do the same EM algorithm considerations as above; note that this expression no longer involves the distribution function. We consider one observation *Yi > τ* and we drop the lower index *i*. This gives us complete observation *(Y, K, Z* = *(Zj )*1≤*j*≤*K)* and conditional density

$$f(k, z | \mathbf{y}; \theta, \upsilon/\varphi) = \frac{f(\mathbf{y}, k, z; \theta, \upsilon/\varphi)}{f\_{(\mathbf{r}, \infty)}(\mathbf{y}; \theta, \upsilon/\varphi)} = \frac{f(\mathbf{y}, k, z; \theta, \upsilon/\varphi)}{\exp\{\ell\_{Y = \mathbf{y} > \mathbf{r}}(\theta)\}},$$

where *Y >τ (θ )* is the log-likelihood of the lower-truncated datum *Y>τ* . Choose an arbitrary density *π* modeling the random vector *(K, Z)* below the truncation point *τ* . This gives us for the random vector *(K, Z)* ∼ *π*

$$\begin{split} \ell\_{Y>\pi}(\theta) &= \int \pi(k,z) \, \ell\_{Y>\pi}(\theta) \, dv(k,z) \\ &= \int \pi(k,z) \log \left( \frac{f(Y,k,z;\theta,v/\rho)/\pi(k,z)}{f(k,z|Y;\theta,v/\rho)/\pi(k,z)} \right) dv(k,z) \\ &= \int \pi(k,z) \log \left( \frac{f(Y,k,z;\theta,v/\rho)}{\pi(k,z)} \right) dv(k,z) + D\_{\text{KL}}\left(\pi||f(\cdot|Y;\theta,v/\rho)|\right) \\ &\geq \int \pi(k,z) \log \left( \frac{f(Y,k,z;\theta,v/\rho)}{\pi(k,z)} \right) dv(k,z) \\ &= \mathbb{E}\_{\pi} \left[ \ell\_{(Y,K,Z)}(\theta) \, \big|\, Y \right] - \mathbb{E}\_{\pi} \left[ \log \pi(K,Z) \right] \\ &= \log f(Y;\theta,v/\rho) + \mathbb{E}\_{\pi} \left[ \sum\_{j=1}^{K} \log f(Z\_j;\theta,v/\rho) \right] - \mathbb{E}\_{\pi} \left[ \log \pi(K,Z) \right] \\ &\overset{\text{def}}{=} \mathcal{Q}(\theta;\pi), \end{split}$$

256 6 Bayesian Methods, Regularization and Expectation-Maximization

where the second last identity uses that the log-likelihood (6.51) has a simple form under the geometric distribution chosen for *K*; this is exactly the step where we benefit from this specific choice of the probability extension below the truncation point. There is a subtle point here. Namely, *Y >τ (θ )* is the log-likelihood of the lower-truncated datum *Y >τ* , whereas log *f (Y* ; *θ , v/ϕ)* is the log-likelihood not using any lower-truncation.

The **E-step** for given canonical parameter estimate *<sup>θ</sup>(t*−1*)* reads as

$$\begin{split} \widehat{\pi}^{(t)} &= \operatorname\*{arg\,max}\_{\pi} \mathcal{Q}\left(\widehat{\theta}^{(t-1)}; \pi\right) \ &= \operatorname\*{arg\,min}\_{\pi} D\_{\text{KL}}\left(\pi \, \middle| \, f(\cdot|Y; \widehat{\theta}^{(t-1)}, v/\varphi) \right) \\ &= f\left(\cdot \, \middle| \, Y; \widehat{\theta}^{(t-1)}, v/\varphi \right) \\ &= p\left(\cdot; \widehat{\theta}^{(t-1)}, v/\varphi \right) \prod\_{j=1}^{\cdot} \frac{f(\cdot, j; \widehat{\theta}^{(t-1)}, v/\varphi)}{F(\tau; \widehat{\theta}^{(t-1)}, v/\varphi)}. \end{split}$$

The latter describes a compound distribution for *<sup>K</sup> <sup>j</sup>*=<sup>1</sup> *Zj* with a geometric count random variable *K* and independent i.i.d. random variables *Z*1*, Z*2*,...,* having upper-truncated densities *f(*−∞*,τ* ]*(*·; *<sup>θ</sup>(t*−1*) , v/ϕ)*. This allows us to calculate the expected compound claim below the truncation point

$$\begin{split} \widehat{Y}\_{\leq \tau}^{(t)} &= \mathbb{E}\_{\widehat{\pi}^{(t)}} \left[ \sum\_{j=1}^{K} Z\_{j} \right] \\ &= \frac{F(\tau; \widehat{\theta}^{(t-1)}, v/\varphi)}{1 - F(\tau; \widehat{\theta}^{(t-1)}, v/\varphi)} \int z \, f\_{( - \infty, \tau]}(z; \widehat{\theta}^{(t-1)}, v/\varphi) \, d\nu(z). \end{split}$$

This completes the E-step.

The **M-step** considers within the EDF

$$\begin{split} \widehat{\theta}^{(t)} &= \operatorname\*{arg\,max}\_{\theta} \mathcal{Q}\left(\theta; \widehat{\pi}^{(t)}\right) \\ &= \operatorname\*{arg\,max}\_{\theta} \frac{\left(Y + \mathbb{E}\_{\widehat{\pi}^{(t)}} \left[\sum\_{j=1}^{K} Z\_{j}\right]\right) \theta - \left(1 + \mathbb{E}\_{\widehat{\pi}^{(t)}} \left[K\right] \right) \kappa(\theta)}{\varphi/\upsilon} \\ &= \operatorname\*{arg\,max}\_{\theta} \frac{\upsilon(1 + \mathbb{E}\_{\widehat{\pi}^{(t)}} \left[K\right])}{\varphi} \left[\left(\frac{Y + \widehat{Y}\_{\leq \mathsf{T}}^{(t)}}{1 + \mathbb{E}\_{\widehat{\pi}^{(t)}} \left[K\right]}\right) \theta - \kappa(\theta)\right]. \end{split}$$

That is, the M-step applies the classical MLE step, we only need to change weights and observations

$$v \mapsto v^{(t)} = v\left(1 + \mathbb{E}\_{\widehat{\pi}^{(t)}}\left[K\right]\right) \\ = \frac{v}{1 - F(\tau; \widehat{\theta}^{(t-1)}, v/\varphi)},$$

$$Y \mapsto \widehat{Y}^{(t)} = \frac{Y + \widehat{Y}\_{\leq \tau}^{(t)}}{1 + \mathbb{E}\_{\widehat{\pi}^{(t)}}\left[K\right]} \\ = \frac{Y + \mathbb{E}\_{\widehat{\pi}^{(t)}}\left[K\right]\mathbb{E}\_{\widehat{\pi}^{(t)}}\left[Z\_{1}\right]}{1 + \mathbb{E}\_{\widehat{\pi}^{(t)}}\left[K\right]}.$$

Note that this uses the specific structure of the EDF, in particular, we benefit from linearity here which allows for closed-form solutions.

#### EM algorithm for lower-truncated data within the EDF

	- **E-step.** Given parameter *<sup>θ</sup> (t*−1*)* <sup>=</sup> *( <sup>θ</sup>(t*−1*) <sup>i</sup> )*1≤*i*≤*n*, estimate the number of claims *K* and the corresponding claim sizes *Zi,j* by

$$
\widehat{K}\_{l}^{(t)} = \frac{F(\mathbf{r}; \widehat{\theta}\_{l}^{(t-1)}, v\_{l}/\varphi)}{1 - F(\mathbf{r}; \widehat{\theta}\_{l}^{(t-1)}, v\_{l}/\varphi)},
$$

$$
\widehat{Z}\_{l,1}^{(t)} = \int z \, f\_{( - \infty, \mathbf{r}]}(z; \widehat{\theta}\_{l}^{(t-1)}, v\_{l}/\varphi) \, d\nu(z). \tag{6.52}
$$

This provides us with estimated weights and observations for 1 ≤ *i* ≤ *n*

$$v\_{l}^{(t)} = v\_{l} \left(1 + \widehat{K}\_{l}^{(t)}\right) \qquad \text{and} \qquad \widehat{Y}\_{l}^{(t)} = \frac{Y\_{l} + \widehat{K}\_{l}^{(t)} \widehat{Z}\_{l,1}^{(t)}}{1 + \widehat{K}\_{l}^{(t)}}.$$

• **M-step.** Calculate the MLE *<sup>θ</sup> (t )* <sup>=</sup> *( <sup>θ</sup>(t ) <sup>i</sup> )*1≤*i*≤*<sup>n</sup>* based on observations *<sup>Y</sup>(t )* <sup>=</sup> *(Y (t ) <sup>i</sup> )*- <sup>1</sup>≤*i*≤*<sup>n</sup>* and weights *<sup>v</sup>(t )* <sup>=</sup> *( <sup>v</sup>(t ) <sup>i</sup> )*- <sup>1</sup>≤*i*≤*n*, i.e., solve

$$\widehat{\boldsymbol{\theta}}^{(t)} = \mathop{\arg\max}\_{\boldsymbol{\theta}} \ell\_{\widehat{Y}^{(t)}}(\boldsymbol{\theta}; \widehat{\boldsymbol{v}}^{(t)}/\boldsymbol{\varphi}) = \mathop{\arg\max}\_{\boldsymbol{\theta}} \sum\_{i=1}^{n} \log f(\widehat{Y}\_{i}^{(t)}; \boldsymbol{\theta}\_{i}, \widehat{v}\_{i}^{(t)}/\boldsymbol{\varphi}).$$

*Remarks 6.17* Essentially, the above algorithm uses that the MLE in the EDF is based on a sufficient statistics of the observations, and in our case this sufficient statistics is *Y (t ) <sup>i</sup>* .

*Example 6.18 (Lower-Truncated Claim Sizes)* We revisit the gamma claim size GLM introduced in Sect. 5.3.7, see also Example 6.16 on right-censored claims. We choose as lower-truncation point *τ* = 1 000, i.e., we get rid of the very small claims that mainly generate administrative expenses at a rather small claim compensation. We have 70 claims below this truncation point, and there remain *n* = 586 claims above the truncation point that can be used for model fitting in the lower-truncated case. We use the EM algorithm for lower-truncated data to fit a GLM to this problem.

In order to calculate the E-step we need to evaluate the conditional expectation (6.52) under the gamma model for truncation probability

$$F(\tau; \widehat{\theta}^{(t-1)}, v/\varphi) = \int\_0^{\tau} \frac{\beta^{\alpha}}{\Gamma(\alpha)} z^{\alpha - 1} \exp\{-\beta z\} \, dz = \mathcal{G}(\alpha, \beta \tau),$$

with shape parameter *<sup>α</sup>* <sup>=</sup> *v/ϕ* and scale parameter *<sup>β</sup>* = − *<sup>θ</sup>(t*−1*) v/ϕ*. In complete analogy to (6.47) we have

$$
\widehat{Z}\_1^{(t)} = \int z \, f\_{(\infty,\tau]}(z; \widehat{\theta}^{(t-1)}, v/\wp) \, d\nu(z) \\
= \frac{\alpha}{\beta} \frac{\mathcal{G}(\alpha+1, \beta\tau)}{\mathcal{G}(\alpha, \beta\tau)}.
$$

For the modeling we choose again the features as used for model Gamma GLM2, this gives *q*+1 = 7 regression parameter components and additionally we set for the dispersion parameter *<sup>ϕ</sup>*MLE <sup>=</sup> <sup>1</sup>*.*427. This dispersion parameter we keep fixed in all the models studied in this example. In a first step we simply fit a gamma GLM to the lower-truncated data *Yi > τ* . We call this model 'crude GLM2', and it overestimates the true claim sizes because it ignores the fact of having lower-truncated data.

To initialize the EM algorithm for lower-truncated data we use the model crude GLM2. We then iterate the algorithm for 10 steps which provides convergence. The results are presented in Table 6.5. We observe that the resulting log-likelihood fitted on the lower-truncated data and evaluated on the complete data *<sup>Y</sup>* (which is available here) is the same as for model Gamma GLM2 which has been estimated on the complete data. Moreover, this lower-truncated EM algorithm fitted model slightly under-estimates the average claim sizes.

Figure 6.18 shows the estimated means *μi* on an individual claims level. The *x*-axis always gives the estimates from the complete log-likelihood model Gamma GLM2. The *y*-axis on the left-hand side shows the estimates from the crude GLM and the right-hand side the estimates from the EM algorithm fitted counterpart (fitted on the lower-truncated data). We observe that the crude model overestimates

**Table 6.5** Comparison of the complete log-likelihood and the incomplete log-likelihood (lowertruncation *τ* = 1 000) results


**Fig. 6.18** Comparison of the estimated means *μi* in model Gamma GLM2 against (lhs) the crude GLM and (rhs) the EM fitted lower-truncated model; both axis are on the log-scale

the claims (being above the orange diagonal), in particular, this applies to claims with lower expected claim amounts. The EM algorithm fitted model, considering the fact that we have lower-truncated data, corrects for the truncation, and the resulting estimates almost completely coincide with the ones from the complete loglikelihood model. Again we remark that we use an identical dispersion parameter estimate *<sup>ϕ</sup>*MLE, and it is an open problem to select a reasonable value from lowertruncated data. -

*Example 6.19 (Zero-Truncated Claim Counts and the Hurdle Poisson Model)* In Sect. 5.3.6, we have been studying the ZIP model that has assigned an additional probability weight to the event {*N* = 0} of having zero claims. This model can be understood as a hierarchical model with a latent variable *Z* indicating whether we have an excess zero claim or not, see (5.41). In that situation we have a mixture distribution of a Poisson distribution and a degenerate distribution. Fitting in Example 5.25 has been done brute force by using a general purpose optimizer, but we could also use the EM algorithm for mixture distributions.

An alternative way of modeling excess zeros is the hurdle approach which combines a lower-truncated count distribution with a point mass in zero. For the Poisson case this reads as, see (5.42),

$$f\_{\text{hurble Poisson}}(k; \lambda, v, \pi\_0) = \begin{cases} \pi\_0 & \text{for } k = 0, \\ (1 - \pi\_0) \frac{e^{-v\lambda} \frac{(v\lambda)^k}{k!}}{1 - e^{-v\lambda}} & \text{for } k \in \mathbb{N}, \end{cases} \tag{6.53}$$

for *π*<sup>0</sup> ∈ *(*0*,* 1*)* and *λ, v >* 0. If we ignore any observation {*N* = 0} we obtain a lower-truncated Poisson model, also called zero-truncated Poisson (ZTP) model. This ZTP model can be fitted with the EM algorithm for lower-truncated data. In the following we only consider insurance policies *i* with *Ni >* 0. The log-likelihood of the ZTP model *N >* 0 is given by (we consider one single component only and drop the lower index in the notation)

$$\theta \mapsto \ell\_{N \ge 0}(\theta) = N\theta - \nu e^{\theta} - \log(N!) + N\log(\upsilon) - \log(1 - e^{-\upsilon e^{\theta}}), \qquad (6.54)$$

with exposure *v >* 0 and canonical parameter *<sup>θ</sup>* <sup>∈</sup> <sup>=</sup> <sup>R</sup> such that *<sup>λ</sup>* <sup>=</sup> exp{*θ*}. The ZTP model provides for the random variable *K* the following geometric distribution (for the number of claims below the truncation point), see (6.50),

$$\mathbb{P}\_{\theta}[K=k] = \mathbb{P}\_{\theta}[N=0]^k \, \mathbb{P}\_{\theta}[N>0] \, = \, e^{-k\upsilon e^{\theta}} \left(1 - e^{-\upsilon e^{\theta}}\right).$$

In view of (6.51), this gives us complete log-likelihood (note that *Zj* = 0 for all *j* )

$$\begin{aligned} \ell\_{(N,K,Z)}(\theta) &= N\theta - v\epsilon^{\theta} - \log(N!) + N\log(v) + \sum\_{j=1}^{K} \left( Z\_j \theta - v\epsilon^{\theta} - \log(Z\_j!) + Z\_j \log(v) \right) \\ &= N\theta - (1+K)\upsilon\epsilon^{\theta} - \log(N!) + N\log(v). \end{aligned}$$

We can now directly apply a simplified version of the EM algorithm for lowertruncated data. For the E-step we have, given parameter *<sup>θ</sup>(t*−1*)* ,

$$\widehat{K}^{(t)} = \frac{\mathbb{P}\_{\widehat{\theta}^{(t-1)}}[N=0]}{1 - \mathbb{P}\_{\widehat{\theta}^{(t-1)}}[N=0]} = \frac{e^{-\nu e^{\widehat{\theta}^{(t-1)}}}}{1 - e^{-\nu e^{\widehat{\theta}^{(t-1)}}}} \qquad \text{and} \qquad \widehat{Z}\_{1}^{(t)} = 0.$$

This provides us with the estimated weights and observations (set *Y* = *N/v*)

$$v^{(t)} = v\left(1 + \widehat{K}^{(t)}\right) = \frac{v}{1 - e^{-v\epsilon^{\widehat{\mathcal{S}}(t-1)}}} \qquad \text{and} \qquad \widehat{Y}^{(t)} = \frac{Y}{1 + \widehat{K}^{(t)}} = \frac{N}{v^{(t)}}.\tag{6.55}$$

Thus, the EM algorithm iterates Poisson MLEs, and the E-Step modifies the weights *v(t )* in each step of the loop correspondingly. We remark that the ZTP model has an EF representation which allows one to directly estimate the corresponding parameters without using the EM algorithm, see Remark 6.20, below.

We revisit the French MTPL claim frequency data, and, in particular, we use model Poisson GLM3 as a benchmark, we refer to Tables 5.5 and 5.10. The feature engineering is done exactly as in model Poisson GLM3. We then select only the insurance policies from the learning data *L* that have suffered at least one claim, i.e., *Ni >* 0. These are *m* = 22 434 out of *n* = 610 206 insurance policies. Thus, we only consider *m/n* = 3*.*68% of all insurance policies, and we fit the lower-truncated log-likelihood (ZTP model) to this data

$$\ell\_{N>0}(\mathfrak{F}) = \sum\_{l=1}^{m} N\_l \theta\_l - v\_l e^{\theta\_l} - \log(N\_l!) + N\_l \log(v\_l) - \log(1 - e^{-v\_l e^{\theta\_l}}),$$

**Fig. 6.19** (lhs) Convergence of the EM algorithm for the lower-truncated data in the Poisson hurdle case; (rhs) canonical parameters of the Poisson GLMs fitted on all data *L* vs. fitted only on policies with *Ni >* 0

**Table 6.6** Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses (units are in 10−2) and in-sample average frequency of the Poisson null model and the Poisson, negative-binomial, ZIP and hurdle Poisson GLMs


where 1 ≤ *i* ≤ *m* runs over all insurance policies with at least one claim and where the canonical parameter *θi* is given by the linear predictor *θi* = *β, x<sup>i</sup>* . We fit this model using the EM algorithm for lower-truncated data. In each loop this requires that the offset *o(t ) <sup>i</sup>* <sup>=</sup> log*(v(t ) <sup>i</sup> )* is adjusted according to (6.55); for the discussion of offsets we refer to Sect. 5.2.3. Convergence of the EM algorithm is achieved after roughly 75 iterations, see Fig. 6.19 (lhs).

In our first analysis we do not consider the Poisson hurdle model, but we simply consider model Poisson GLM3. However, this Poisson model with regression parameter *β* is fitted only on the data *Ni >* 0 (exactly using the results of the EM algorithm for lower-truncated data *Ni >* 0). The resulting predictive model is presented in Table 6.7. We observe that model Poisson GLM3 that is only fitted on the data *Ni >* 0 is clearly not competitive, i.e., we cannot simply extrapolate this estimated model to {*Ni* = 0}. This extrapolation results in a Poisson GLM that has a much too large average frequency of 15.11%, see last column of Table 6.7; this bias can clearly be seen in Fig. 6.19 (rhs) where we compare the two fits. From this we conclude that either the Poisson model assumption in general does not

**Table 6.7** Number of parameters, in-sample and out-of-sample deviance losses on all data (units are in 10−2), out-of-sample lower-truncated log-likelihood *N>*<sup>0</sup> and in-sample average frequency of the Poisson null model and model Poisson GLM3 fitted on all data *L* and fitted on the data *Ni >* 0 only


match the data, or that we have excess zeros (which do not influence the estimation procedure if we only consider the policies with at least one claim). Let us compare the lower-truncated log-likelihood *N>*<sup>0</sup> out-of-sample only on the policies with at least one claim (ZTP model). We observe that the EM fitted model provides a better description of the data, as we have a bigger log-likelihood than the model fitted on all data *L* (i.e. −0*.*2195 vs. −0*.*2278 for the ZTP log-likelihood). Thus, the lowertruncated fitting procedure finds a better model on {*Ni >* 0} when only fitted on these lower-truncated claim counts.

This analysis concludes that we need to fit the full hurdle Poisson model (6.53). That is, we cannot simply extrapolate the model fitted on the ZTP log-likelihood *N>*<sup>0</sup> because, typically, *<sup>π</sup>*0*(xi)* = exp{−*vie<sup>β</sup>,x<sup>i</sup>* }, the latter coming from the Poisson GLM with regression parameter *β*. We model the zero claim probability *π*0*(xi)* by the logistic Bernoulli GLM indicating whether we have claims or not. We set up the logistic GLM for *p(xi)* = 1 − *π*0*(xi)* of describing the indicator *Yi* <sup>=</sup> <sup>1</sup>{*Ni>*0} of having claims. The difficulty compared to the Poisson model is that we cannot easily integrate the time exposure *vi* as a pro rata temporis variable like in the Poisson case. We therefore make the following considerations. The canonical link in the logistic Bernoulli GLM is the logit function *p* → logit*(p)* = log*(p/(*1− *p))* = log*(p)* − log*(*1 − *p)* for *p* ∈ *(*0*,* 1*)*. Typically, in our application, *p* " 1 is fairly small because claims are rare events. This implies log*(p/(*1 − *p))* ≈ log*(p)*, i.e., the logit link behaves similarly to the log-link for small default probabilities *p*. This motivates to integrate the logged exposures log *vi* as offsets into the logistic probabilities. That is, we make the following model assumption

$$p(\mathbf{x}, \upsilon) \leftrightarrow \text{logit}(p(\mathbf{x}\_l, \upsilon\_l)) = \log(\upsilon\_l) + \langle \check{\mathfrak{g}}, \mathbf{x}\_l \rangle,$$

with offset *oi* <sup>=</sup> log*(vi)* and regression parameter\**<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+1. We fit this model using the R command glm using family=binomial(). The results then allow us to define the estimated hurdle Poisson model by, recall *p(xi, vi)* = 1 − *π*0*(xi, vi)*,

$$f\_{\text{hullle Poisson}}(k; \mathbf{x}\_i, v\_i) = \begin{cases} 1 - p(\mathbf{x}\_i, v\_i) = \left( 1 + \exp(\log(v\_i) + \langle \widetilde{\mathbf{g}}, \mathbf{x}\_i \rangle) \right)^{-1} & \text{for } k = 0, 1\\ \frac{p(\mathbf{x}\_i, v\_i)}{1 - e^{-\mu(\mathbf{x}\_i, v\_i)}} e^{-\mu(\mathbf{x}\_i, v\_i)} \frac{\mu(\mathbf{x}\_i, v\_i)^k}{k!} & \text{for } k \in \mathbb{N}, \end{cases}$$


**Table 6.8** Contingency table of the observed numbers of policies against predicted numbers of policies with given claim counts ClaimNb (in-sample)

where \**<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+<sup>1</sup> is the regression parameter from the logistic Bernoulli GLM, and where *μ(xi, vi)* = *vi* exp*β, x<sup>i</sup>* is the Poisson GLM estimated with the EM algorithm on the lower-truncated data *Ni >* 0 (ZTP model). The results are presented in Table 6.6.

Table 6.6 compares the hurdle Poisson model to the approaches studied in Table 5.10. Firstly, fitting the hurdle Poisson model is more time intensive, the EM algorithm takes some time and we need to fit the Bernoulli logistic GLM which is of a similar complexity as fitting model Poisson GLM3. The results in terms of AIC look convincing. The hurdle Poisson model provides an excellent model for the indicator of having a claim (here it outperforms model ZIP GLM3). It also tries to optimally fit a ZTP model to all insurance policies having at least one claim. This can also be seen from Table 6.8 which determines the expected number of policies that suffer the different numbers of claims.

We close this example by concluding that the hurdle Poisson model provides the best description, at the price of using more parameters. The ZIP model could be lifted to a similar level, however, we consider fitting the hurdle approach to be more convenient, see also Remark 6.20, below. In particular, feature engineering seems simpler in the hurdle approach because the different effects are clearly separated, whereas in the ZIP approach it is more difficult to suitably model the excess zeros, see also Listing 5.10. This closes this example. -

*Remark 6.20* In (6.54) we have been considering the ZTP model for different exposures *v >* 0. If we set these exposures to *v* = 1, we obtain the ZTP loglikelihood

$$\ell\_{N \ge 0}(\theta) = N\theta - \left(e^{\theta} + \log(1 - e^{-e^{\theta}})\right) - \log(N!).$$

Note that this describes a single-parameter linear EF with cumulant function

$$\kappa(\theta) = e^{\theta} + \log(1 - e^{-e^{\theta}}),$$

for canonical parameter in the effective domain *<sup>θ</sup>* <sup>∈</sup> <sup>=</sup> <sup>R</sup>. The mean of this EF model is given by

$$
\mu = \mathbb{E}\_{\theta}[N] = \kappa'(\theta) = \frac{e^{\theta}}{1 - e^{-e^{\theta}}} = \frac{\lambda}{1 - e^{-\lambda}},
$$

where we set *<sup>λ</sup>* <sup>=</sup> *<sup>e</sup><sup>θ</sup>* . The variance is given by

$$\text{Var}\_{\theta}(N) = \kappa''(\theta) = \mu \left( \frac{e^{\lambda} - (1 + \lambda)}{e^{\lambda} - 1} \right) \\ = \mu \left( 1 - \mu e^{-\lambda} \right) \\ > 0.$$

Note that the term in brackets is positive but less than one. The latter implies that the ZTP model has under-dispersion. Alternatively to the EM algorithm, we can also directly fit a GLM to this ZTP model. The only difficulty is that we need to appropriately integrate the time exposures. The original Poisson model suggests that if we choose the canonical parameter being equal to the linear predictor, we should integrate the logged exposures as offsets into the linear predictors. Along these lines, if we choose the canonical link *h* = *(κ )*−<sup>1</sup> of the ZTP model, we receive that the canonical parameter *θ* is equal to the linear predictor *β, x* , and we can directly integrate the logged exposures as offsets into the canonical parameters, see (5.25). This then allows us to directly fit this ZTP model with exposures using Fisher's scoring method. In this case of a concave log-likelihood function, the result will be identical to the solution of the EM algorithm found in Example 6.19, and, in fact, this direct approach is more straightforward and more time-efficient. Similar considerations can be done for other hurdle models.

## *6.4.4 Composite Models*

In Sect. 6.3.1 we have promoted to mix distributions in cases where the data cannot be modeled by a single EDF distribution. Alternatively, one can also consider to compose densities which leads to so-called *composite models* (also called splicing models). This idea has been introduced to the actuarial literature by Cooray–Ananda [81] and Scollnik [332]. Assume we have two absolutely continuous densities *<sup>f</sup> (i)(*·; *θi)* with corresponding distribution functions *<sup>F</sup>(i)(*·; *θi)*, *<sup>i</sup>* <sup>=</sup> <sup>1</sup>*,* 2. These two densities can easily be composed at a splicing value *τ* and with weight *p* ∈ *(*0*,* 1*)* by considering the following composite density

$$f(\mathbf{y}; p, \theta\_{\mathbf{l}}, \theta\_{\mathbf{2}}) = p \frac{f^{(\mathbf{l})}(\mathbf{y}; \theta\_{\mathbf{l}}) \mathbb{1}\_{\{\mathbf{y} \le \mathbf{r}\}}}{F^{(\mathbf{l})}(\mathbf{r}; \theta\_{\mathbf{l}})} + (1 - p) \frac{f^{(2)}(\mathbf{y}; \theta\_{\mathbf{2}}) \mathbb{1}\_{\{\mathbf{y} > \mathbf{r}\}}}{1 - F^{(2)}(\mathbf{r}; \theta\_{\mathbf{2}})},\qquad(6.56)$$

supposed that both denominators are non-zero. In this notation we treat splicing value *τ* as a hyper-parameter that is chosen by the modeler, and is not estimated from data. In view of (6.41) we can rewrite this in terms for lower- and uppertruncated densities

$$f(\mathbf{y}; p, \theta\_1, \theta\_2) = \,\_p f^{(\mathbf{l})}\_{( -\infty, \mathbf{r} ]}(\mathbf{y}; \theta\_1) + (\mathbf{l} - p) \, f^{(2)}\_{(\mathbf{r}, \infty)}(\mathbf{y}; \theta\_2) \,.$$

In this notation, we see that a composite model can also be interpreted as a mixture model with mixture probability *<sup>p</sup>* <sup>∈</sup> *(*0*,* <sup>1</sup>*)* and mixing densities *<sup>f</sup> (*1*) (*−∞*,τ* ] and *<sup>f</sup> (*2*) (τ,*∞*)* having disjoint supports *(*∞*, τ* ] and *(τ,*∞*)*, respectively.

These disjoint supports allow for simpler MLE, i.e., we do not need to rely on the 'EM algorithm for mixture distributions' to fit this model. The log-likelihood of *Y* ∼ *f (y*; *p, θ*1*, θ*2*)* is given by

$$\begin{aligned} \ell\_Y(p, \theta\_1, \theta\_2) &= \left(\log(p) + \log f^{(1)}\_{\left( -\infty, \mathfrak{r} \right]}(Y; \theta\_1) \right) \mathbb{1}\_{\{Y \le \mathfrak{r}\}} \\ &+ \left(\log(1 - p) + \log f^{(2)}\_{\left( \mathfrak{r}, \infty \right)}(Y; \theta\_2) \right) \mathbb{1}\_{\{Y > \mathfrak{r}\}}. \end{aligned}$$

This shows that the log-likelihood nicely decouples in the composite case and all parameters can directly be estimated with MLE: parameter *θ*<sup>1</sup> uses all observations smaller or equal to *τ* , parameter *θ*<sup>2</sup> uses all observations bigger than *τ* , and *p* is estimated by the proportions of claims below and above the splicing point *τ* . This holds for a null model as well as for a GLM approach for *θ*1, *θ*<sup>2</sup> and *p*.

Nevertheless, the EM algorithm may still be used for parameter estimation, namely, truncation may ask for the 'EM algorithm for truncated data'. Alternatively, we could also use the 'EM algorithm for censored data' to estimate the truncated densities, because we have knowledge of the number of claims above and below the splicing point *τ* , thus, we could right- or left-censor these claims. The latter may lead to more stability in the estimation procedure since we use more information in parameter estimation, i.e., the two truncated densities will not be independent because they simultaneously consider all claim counts (but not identical claim sizes due to censoring).

For composite models one sometimes requires more regularity in the densities, we may, e.g., require continuity in the density in the splicing point which provides mixture probability

$$p = \frac{f^{(2)}(\mathfrak{r}; \theta\_2) F^{(1)}(\mathfrak{r}; \theta\_1)}{f^{(1)}(\mathfrak{r}; \theta\_1)(1 - F^{(2)}(\mathfrak{r}; \theta\_2)) + f^{(2)}(\mathfrak{r}; \theta\_2) F^{(1)}(\mathfrak{r}; \theta\_1)}.$$

This reduces the number of parameters to be estimated but complicates the score equations. If we require a differential condition in *τ* we receive requirement

$$p = \frac{f\_{\mathbf{y}}^{(2)}(\mathbf{r}; \theta\_2) F^{(\mathbf{l})}(\mathbf{r}; \theta\_\mathbf{l})}{f\_{\mathbf{y}}^{(1)}(\mathbf{r}; \theta\_\mathbf{l})(1 - F^{(2)}(\mathbf{r}; \theta\_2)) + f\_{\mathbf{y}}^{(2)}(\mathbf{r}; \theta\_2) F^{(\mathbf{l})}(\mathbf{r}; \theta\_\mathbf{l})},$$

where *f (i) <sup>y</sup> (y*; *θi)* denotes the first derivative w.r.t. *y*. Together with the continuity this provides requirement for having differentiability in *τ*

$$\frac{f^{(2)}(\mathfrak{r};\theta\_2)}{f^{(1)}(\mathfrak{r};\theta\_1)} = \frac{f\_{\mathfrak{y}}^{(2)}(\mathfrak{r};\theta\_2)}{f\_{\mathfrak{y}}^{(1)}(\mathfrak{r};\theta\_1)}.$$

Again this reduces the degrees of freedom in parameter estimation but complicates the score equations. We refrain from giving an example and close this section; we will consider a deep composite regression model in Sect. 11.3.2, below, where we replace the fixed splicing point by a quantile for a fixed quantile level.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 7 Deep Learning**

In the sequel, we introduce deep learning models. In this chapter these deep learning models will be based on fully-connected feed-forward neural networks. We present these networks as an extension of GLMs. These networks perform feature engineering themselves. We discuss how networks achieve this, and we explain how networks are used for predictive modeling. There is a vastly growing literature on deep learning with networks, the classical reference is the book of Goodfellow et al. [166], but also the numerous tutorials around the open-source deep learning libraries TensorFlow [2], Keras [77] or PyTorch [296] give an excellent overview of the state-of-the-art in this field.

## **7.1 Deep Learning and Representation Learning**

In Chap. 5 on GLMs, we have been modeling the mean structure of the responses *Y* , given features *x*, by the following regression function, see (5.6),

$$\mu \mapsto \mu(\mathbf{x}) = \mathbb{E}\_{\theta(\mathbf{x})} \left[ Y \right] = \operatorname{g}^{-1} \langle \boldsymbol{\mathfrak{g}}, \mathbf{x} \rangle. \tag{7.1}$$

The crucial assumption has been that the regression function (7.1) provides a reasonable functional description of the expected value <sup>E</sup>*θ (x)*[*<sup>Y</sup>* ] of datum *(Y, <sup>x</sup>)*. As described in Sect. 5.2.2, this typically requires *manual feature engineering* of *x*, bringing feature information into the right structural form.

In contrast to manual feature engineering, deep learning aims at performing an *automated feature engineering* within the statistical model by massaging information through different transformations. Deep learning uses a finite sequence of functions *(z(m))*<sup>1</sup>≤*m*≤*<sup>d</sup>* , called *layers*,

$$
\mathfrak{z}^{(m)}: \{1\} \times \mathbb{R}^{q\_{m-1}} \to \{1\} \times \mathbb{R}^{q\_m},
$$

© The Author(s) 2023 M. V. Wüthrich, M. Merz, *Statistical Foundations of Actuarial Learning and its Applications*, Springer Actuarial, https://doi.org/10.1007/978-3-031-12409-9\_7 of (fixed) dimensions *qm* <sup>∈</sup> <sup>N</sup>, 1 <sup>≤</sup> *<sup>m</sup>* <sup>≤</sup> *<sup>d</sup>*, and initialization *<sup>q</sup>*<sup>0</sup> <sup>=</sup> *<sup>q</sup>* being the dimension of the (raw) feature information *<sup>x</sup>* <sup>∈</sup> *<sup>X</sup>* ⊂ {1} × <sup>R</sup>*<sup>q</sup>* . Each of these layers presents a new *representation of the features*, that is, after layer *m* we have a *qm*-dimensional representation of the raw feature *x* ∈ *X*

$$\mathbf{z}^{(m:1)}(\mathbf{x}) \stackrel{\text{def.}}{=} \left(\mathbf{z}^{(m)} \circ \cdots \circ \mathbf{z}^{(1)}\right)(\mathbf{x}) \in \{1\} \times \mathbb{R}^{q\_m}.\tag{7.2}$$

Note that the first component is always identically equal to 1. For this reason we call the representation *z(m*:1*) (x)* ∈ {1} × <sup>R</sup>*qm* of *<sup>x</sup>* to be *qm*-dimensional.

*Deep learning* now assumes that we have *<sup>d</sup>* <sup>∈</sup> <sup>N</sup> appropriate transformations (layers) *<sup>z</sup>(m)*, 1 <sup>≤</sup> *<sup>m</sup>* <sup>≤</sup> *<sup>d</sup>*, such that *<sup>z</sup>(d*:1*) (x)* provides a suitable *qd* -dimensional representation of the raw feature *x* ∈ *X*, that then enters a GLM

$$
\mu(\mathbf{x}) = \mathbb{E}\_{\theta(\mathbf{x})} \left[ Y \right] = g^{-1} \langle \mathcal{B}, \mathbf{z}^{(d:1)}(\mathbf{x}) \rangle,\tag{7.3}
$$

with link function *<sup>g</sup>* : *<sup>M</sup>* <sup>→</sup> <sup>R</sup> and regression parameter *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*qd*+1. This regression architecture is called a *feed-forward network* of *depth <sup>d</sup>* <sup>∈</sup> <sup>N</sup> because information *x* is processed in a directed acyclic (feed-forward) path through the *d* layers *z(*1*) ,..., z(d)* before entering the final GLM.

Each layer *z(m)* involves parameters. Successful deep learning simultaneously fits these parameters as well as the regression parameter *β* to the available learning data *L* so that we obtain an optimal predictive model on the test data *T* . That is, the learned model should optimally generalize to unseen data, we refer to Chap. 4 on predictive modeling. Thus, the process of optimal representation learning is also part of the model fitting procedure. In contrast to GLMs, the resulting log-likelihood functions are non-concave in their parameters because, typically, each layer involves non-linear transformations. This makes model fitting a challenge. State-of-the-art model fitting in deep learning uses variants of the gradient descent algorithm which we have already met in Sect. 6.2.4.

*Remark 7.1* Representation learning *<sup>x</sup>* <sup>→</sup> *<sup>z</sup>(d*:1*) (x)* is closely related to Mercer's kernel [272]. If we have a portfolio with features *x*1*,..., xn*, we obtain a Mercer's kernel by considering the matrix

$$\mathbf{K} = \left( K(\mathbf{x}\_l, \mathbf{x}\_j) \right)\_{1 \le l, j \le n} = \left( \left| z^{(d:l)}(\mathbf{x}\_l), z^{(d:l)}(\mathbf{x}\_j) \right| \right)\_{1 \le l, j \le n} \in \mathbb{R}^{n \times n}. \tag{7.4}$$

In many regression problems it can be shown that one can equivalently work with the design matrix <sup>Z</sup> <sup>=</sup> *(z(d*:1*) (x*1*), . . . , z(d*:1*) (xn))*-<sup>∈</sup> <sup>R</sup>*n*×*(qd*+1*)* or with Mercer's kernel **<sup>K</sup>** <sup>∈</sup> <sup>R</sup>*n*×*n*. Mercer's kernel does not require the full knowledge of the learned representations *z(d*:1*) (xi)*, but it suffices to know the discrepancies between *z(d*:1*) (xi)* and *z(d*:1*) (x<sup>j</sup> )* measured by the scalar products *K(xi, x<sup>j</sup> )*. This is also closely related to the cosine similarity in word embeddings, see (10.11). This approach then results in replacing the search for an optimal representation learning by a search of the optimal Mercer's kernel for the given data; this is called the kernel trick in machine learning.

## **7.2 Generic Feed-Forward Neural Networks**

*Feed-forward neural* (FN) *networks* use special layers *z(m)* in (7.2)–(7.3), whose components are called *neurons*. This is discussed and studied in detail in this section.

## *7.2.1 Construction of Feed-Forward Neural Networks*

FN networks are regression functions of type (7.3) where each neuron *z (m) <sup>j</sup>* , 1 ≤ *<sup>j</sup>* <sup>≤</sup> *qm*, of the layers *<sup>z</sup>(m)* <sup>=</sup> *(*1*, z(m)* <sup>1</sup> *,...,z(m) qm )*-, 1 ≤ *m* ≤ *d*, has the structure of a GLM; the first component *z (m)* <sup>0</sup> = 1 always plays the role of the intercept and does not need any modeling.

A first important choice is the *activation function <sup>φ</sup>* : <sup>R</sup> <sup>→</sup> <sup>R</sup> which plays the role of the inverse link function *g*−1. To perform non-linear representation learning, this activation function should be non-linear, too. The most popular choices of activation functions are listed in Table 7.1.

The first three examples in Table 7.1 are smooth functions with simple derivatives, see the last column of Table 7.1. Having simple derivatives is an advantage in gradient descent algorithms for model fitting. The derivative of the ReLU activation function for *x* = 0 is given by the step function activation, and in 0 one typically considers a sub-gradient. We briefly comment on these activation functions.

**Table 7.1** Popular choices of non-linear activation functions and their derivatives; the last two examples are not strictly monotone


• We are mainly going to use the *hyperbolic tangent activation* function

$$\text{If } \chi \mapsto \tanh(\chi) \, = \, \frac{e^{\chi} - e^{-\chi}}{e^{\chi} + e^{-\chi}} \, = \, 2 \left( 1 + e^{-2\chi} \right)^{-1} - 1 \in (-1, 1).$$

Figure 7.1 illustrates the hyperbolic tangent activation function.

The hyperbolic tangent activation function is anti-symmetric w.r.t. the origin with range *(*−1*,* 1*)*. This anti-symmetry and boundedness is an advantage in fitting deep FN network architectures. For this reason we usually prefer the hyperbolic tangent over other activation functions.


A *FN layer* with activation function *φ* is a mapping

$$\mathbf{z}^{(m)}: \{1\} \times \mathbb{R}^{q\_{m-1}} \to \{1\} \times \mathbb{R}^{q\_m} \tag{7.5}$$

$$\mathbf{z} \mapsto \mathbf{z}^{(m)}(\mathbf{z}) = \left(1, z\_1^{(m)}(\mathbf{z}), \dots, z\_{q\_m}^{(m)}(\mathbf{z})\right)^{\top},$$

having neurons for 1 ≤ *j* ≤ *qm*

$$z\_j^{(m)}(\mathbf{z}) = \phi \langle \mathbf{w}\_j^{(m)}, \mathbf{z} \rangle = \phi \left( \sum\_{l=0}^{q\_{m-1}} w\_{l,j}^{(m)} z\_l \right), \tag{7.6}$$

with given *network weights w(m) <sup>j</sup>* <sup>=</sup> *(w(m) l,j )*0≤*l*≤*qm*−<sup>1</sup> <sup>∈</sup> <sup>R</sup>*qm*−1<sup>+</sup>1.

**Interpretation** Every neuron *z* → *z (m) <sup>j</sup> (z)* describes a GLM regression function with link function *φ*−<sup>1</sup> and regression parameter *w(m) <sup>j</sup>* <sup>∈</sup> <sup>R</sup>*qm*−1+<sup>1</sup> for features *<sup>z</sup>* ∈ {1} × <sup>R</sup>*qm*−<sup>1</sup> . These GLM regression functions can be interpreted as data compression, i.e., in each neuron the *qm*−1-dimensional feature *z* is projected to a real number *<sup>w</sup>(m) <sup>j</sup> , <sup>z</sup>*<sup>∈</sup> <sup>R</sup> which is then (non-linearly) activated by *<sup>φ</sup>*. Since this leads to a substantial loss of information, we perform this procedure of data compression *qm* times in FN layer *z(m)*, so that each neuron in *(z(m) <sup>j</sup> (z))*1≤*j*≤*qm* represents a different projection of input *z*. Choosing suitable weights *w(m) <sup>j</sup>* will allow us to extract the crucial feature information from *z* to receive good explanatory variables for the regression task at hand.

A FN network of depth *<sup>d</sup>* <sup>∈</sup> <sup>N</sup> is obtained by composing *<sup>d</sup>* FN layers *z(*1*) ,..., z(d)* to receive the mapping

$$\mathbf{z}^{(d:\mathcal{l})}: \{1\} \times \mathbb{R}^{qo=q} \to \{1\} \times \mathbb{R}^{qd} \tag{7.7}$$

$$\mathbf{x} \mapsto \mathbf{z}^{(d:\mathcal{l})}(\mathbf{x}) = \left(\mathbf{z}^{(d)} \diamond \cdots \diamond \mathbf{z}^{(l)}\right)(\mathbf{x}).$$

Choosing a strictly monotone and smooth link function *g* and a regression parameter *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*qd*+<sup>1</sup> we receive the FN network regression function

$$
\mathfrak{x} \in \mathcal{X} \mapsto \mu(\mathfrak{x}) = \mathfrak{g}^{-1} \langle \mathfrak{z}, \mathfrak{z}^{(d:\mathbb{I})}(\mathfrak{x}) \rangle. \tag{7.8}
$$

**Fig. 7.2** FN network of depth *d* = 3, with number of neurons *(q*1*, q*2*, q*3*)* = *(*20*,* 15*,* 10*)* and input dimension *<sup>q</sup>*<sup>0</sup> <sup>=</sup> 40. This gives us a network parameter *<sup>ϑ</sup>* <sup>∈</sup> <sup>R</sup>*<sup>r</sup>* of dimension *<sup>r</sup>* <sup>=</sup> <sup>1</sup> 306

This FN network regression function (7.8) has a *network parameter ϑ* = *(w(*1*)* <sup>1</sup> *,..., <sup>w</sup>(d) qd , β)*-<sup>∈</sup> <sup>R</sup>*<sup>r</sup>* of dimension

$$r = \sum\_{m=1}^{d} q\_m(q\_{m-1} + 1) + (q\_d + 1).$$

In Fig. 7.2 we illustrate a FN network of depth *d* = 3, FN layers of dimensions *(q*1*, q*2*, q*3*)* <sup>=</sup> *(*20*,* <sup>15</sup>*,* <sup>10</sup>*)* and input dimension *<sup>q</sup>*<sup>0</sup> <sup>=</sup> 40.<sup>1</sup> This gives us a network parameter *<sup>ϑ</sup>* <sup>∈</sup> <sup>R</sup>*<sup>r</sup>* of dimension *<sup>r</sup>* <sup>=</sup> <sup>1</sup> 306. On the left-hand side we have the raw features *<sup>x</sup>* <sup>∈</sup> *<sup>X</sup>* ⊂ {1}×R*q*0, these are processed through the three FN layers, where the black circles illustrate the neurons *z (m) <sup>j</sup>* . The third FN layer *<sup>z</sup>(*3*)* has dimension

<sup>1</sup> Figures 7.2 and 7.9 are similar to Figure 1 in [122], and all FN network plots have been created with modified versions of the plot functions of the R package neuralnet [144].

*<sup>q</sup>*<sup>3</sup> <sup>=</sup> 10 providing the learned representation *<sup>z</sup>(*3:1*) (x)* ∈ {1} × <sup>R</sup>*q*<sup>3</sup> of *<sup>x</sup>*. This is used in the final GLM step (7.8) in the green box of Fig. 7.2.

*Remarks 7.2*


## *7.2.2 Universality Theorems*

The use of FN networks for representation learning is motivated by the so-called *universality theorems* which say that any compactly supported continuous (regression) function can be approximated arbitrarily well by a suitably large FN network. As such, we can understand the FN network framework as an approximation tool which, of course, is useful far beyond statistical modeling. In Chapter 12 we give some proofs of selected universality statements to illustrate the flavor of such results. In particular, Cybenko [86], Hornik et al. [192], Hornik [191], Leshno et al. [247], Park–Sandberg [293, 294], Petrushev [302] and Isenbeck–Rüschendorf [198] have shown (under mild conditions on the activation function) that shallow FN networks can approximate any compactly supported continuous function arbitrarily well (in supremum norm or in *<sup>L</sup>*2-norm), if we allow for an arbitrary number of neurons *<sup>q</sup>*<sup>1</sup> <sup>∈</sup> N in the single FN layer. Roughly speaking, such a result for shallow FN networks holds true if and only if the chosen activation function is non-polynomial, see Leshno et al. [247]. Such results are proved either by algebraic methods of Stone– Weierstrass type or by Wiener–Tauberian denseness type arguments. Moreover, approximation results are studied in Barron [25, 26], Yukich et al. [399], Makavoz [262], Pinkus [303] and Döhler–Rüschendorf [108].

The above stated universality theorems say that shallow FN networks are sufficient from an approximation point of view. Nevertheless, we will mainly use deep (multiple layers) FN networks, below. These have better convergence properties to given function classes because they more easily promote interactions in feature components compared to shallow ones. Such questions have been studied, e.g., by Elbrächter et al. [120], Kidger–Lyons [215], Lu et al. [260] or Cheridito et al. [75]. For instance, Elbrächter et al. [120] compare finite-depth wide networks to finite-width deep networks (under the choice of the ReLU activation function), and they conclude that for many function classes deep networks lead to exponential approximation rates, whereas shallow networks only provide polynomial approximation rates at the same number of network parameters. This motivates to consider sufficiently deep FN networks for representation learning because these typically have a better approximation capacity compared to shallow ones.

We motivate this by two simple examples. For this motivation we use the step function activation *φ(x)* <sup>=</sup> <sup>1</sup>{*x*≥0} ∈ {0*,* <sup>1</sup>}. If we have the step function activation, each neuron partitions R*qm*−<sup>1</sup> along a hyperplane, i.e.,

$$\mathbf{z} \mapsto \boldsymbol{z}\_j^{(m)}(\mathbf{z}) = \boldsymbol{\phi}\langle\mathbf{w}\_j^{(m)}, \mathbf{z}\rangle = \mathbb{1}\_{\left\{\sum\_{l=1}^{q\_{m-1}} w\_{l,j}^{(m)} \boldsymbol{z}\_l \geq -w\_{0,j}^{(m)}\right\}} \in \{0, 1\}.\tag{7.9}$$

For a shallow FN network we can study the question of the maximal complexity of the resulting partition of the feature space *<sup>X</sup>* ⊂ {1} × <sup>R</sup>*q*<sup>0</sup> when considering *<sup>q</sup>*<sup>1</sup> neurons(7.9) in the single FN layer *z(*1*)* . Zaslavsky [400] proved that *q*<sup>1</sup> hyperplanes can partition the Euclidean space R*q*<sup>0</sup> in at most

$$\sum\_{j=0}^{\min\{q\_0, q\_1\}} \binom{q\_1}{j} \qquad \text{disjoint sets.} \tag{7.10}$$

This number (7.10) can be seen as a maximal upper complexity bound for shallow FN networks with step function activation. It grows exponentially for *q*<sup>1</sup> ≤ *q*0, and it slows down to a polynomial growth for *q*<sup>1</sup> *> q*0. Thus, the complexity of shallow FN networks grows comparably slow as the width *q*<sup>1</sup> of the network exceeds *q*0, and therefore we often need a huge network to receive a good approximation.

This result (7.10)should be contrasted to Theorem 4 in Montúfar et al. [280] who give a lower bound on the complexity of regression functions of deep FN networks (under the ReLU activation function). Assume *qm* ≥ *q*<sup>0</sup> for all 1 ≤ *m* ≤ *d*. The maximal complexity is bounded below by

$$
\left(\prod\_{m=1}^{d-1} \left\lfloor \frac{q\_m}{q\_0} \right\rfloor^{q\_0} \right) \sum\_{j=0}^{q\_0} \binom{q\_d}{j} \qquad \text{disjoint linear regions.}\tag{7.11}
$$

If we choose as an example a FN network with fixed width *qm* = 4 for all *m* ≥ 1 and an input of dimension *q*<sup>0</sup> = 2, we receive from (7.11) a lower bound of

$$4^{d-1} \left( \binom{4}{0} + \binom{4}{1} + \binom{4}{2} \right) = \frac{11}{4} \exp\{d \log(4)\}.$$

Thus, we have an exponential growth in depth *d* → ∞. This contrasts the polynomial complexity growth (7.10) of shallow FN networks.

*Example 7.3 (Shallow vs. Deep Networks: Partitions)* We give a second more explicit example that compares shallow and deep FN networks. Choose *q*<sup>0</sup> = 2 and assume we want to describe a regression function

$$
\mu: \mathbb{R}^2 \to \mathbb{R}, \qquad \qquad \mathbf{x} \mapsto \mu(\mathbf{x}).
$$

If we think of a tool box of basis functions to build regression function *μ* we may want to choose indicator functions *x* → *χA(x)* ∈ {0*,* 1} for arbitrary rectangles *A* = [*x*<sup>−</sup> <sup>1</sup> *, x*<sup>+</sup> <sup>1</sup> *)* × [*x*<sup>−</sup> <sup>2</sup> *, x*<sup>+</sup> <sup>2</sup> *)* <sup>⊂</sup> <sup>R</sup>2. We show that we can easily construct such indicator functions *χA(x)* for given rectangles *<sup>A</sup>* <sup>⊂</sup> <sup>R</sup><sup>2</sup> with FN networks of depth *<sup>d</sup>* <sup>=</sup> 2, but not with shallow FN networks.

For illustrative purposes, we fix a square *<sup>A</sup>* = [−1*/*2*,* <sup>1</sup>*/*2*)* × [−1*/*2*,* <sup>1</sup>*/*2*)* <sup>⊂</sup> <sup>R</sup>2, and we want to construct *χA(x)* with a network of depth *d* = 2. This indicator function *χA* is illustrated in Fig. 7.3.

We choose the step function activation for *φ* and a first FN layer with *q*<sup>1</sup> = 4 neurons

$$\begin{split} \mathbf{x} &\mapsto \mathbf{z}^{(\mathbf{l})}(\mathbf{x}) = \left(\mathbf{l}, z\_{\mathbf{l}}^{(\mathbf{l})}(\mathbf{x}), \dots, z\_{\mathbf{4}}^{(\mathbf{l})}(\mathbf{x})\right)^{\top} \\ &= \left(\mathbf{l}, \mathbb{1}\_{\{\mathbf{x}\_{1} \geq -1/2\}}, \mathbb{1}\_{\{\mathbf{x}\_{2} \geq -1/2\}}, \mathbb{1}\_{\{\mathbf{x}\_{1} \geq 1/2\}}, \mathbb{1}\_{\{\mathbf{x}\_{2} \geq 1/2\}}\right)^{\top} \in \{\mathbf{l}\} \times \{\mathbf{0}, 1\}^{4}. \end{split}$$

This FN layer has a network parameter, see also (7.9),

$$\left(\mathbf{w}\_1^{(1)}, \dots, \mathbf{w}\_4^{(1)}\right) = \left(\begin{pmatrix} 1/2\\1\\0 \end{pmatrix}, \begin{pmatrix} 1/2\\0\\1 \end{pmatrix}, \begin{pmatrix} -1/2\\1\\0 \end{pmatrix}, \begin{pmatrix} -1/2\\0\\1 \end{pmatrix}\right),\tag{7.12}$$

having dimension *q*1*(q*<sup>0</sup> + 1*)* = 12. For the second FN layer with *q*<sup>2</sup> = 4 neurons we choose the step function activation and

$$\begin{aligned} \mathbf{z} &\mapsto \mathbf{z}^{(2)}(\mathbf{z}) = \left(1, z\_1^{(2)}(\mathbf{z}), \dots, z\_4^{(2)}(\mathbf{z})\right)^\top \\ &= \left(1, \mathbb{1}\_{\{z\_1+z\_2 \ge 3/2\}}, \mathbb{1}\_{\{z\_2+z\_3 \ge 3/2\}}, \mathbb{1}\_{\{z\_1+z\_4 \ge 3/2\}}, \mathbb{1}\_{\{z\_3+z\_4 \ge 3/2\}}\right)^\top. \end{aligned}$$

This FN layer has a network parameter

$$\left(\mathfrak{w}\_1^{(2)}, \dots, \mathfrak{w}\_4^{(2)}\right) = \left( \begin{pmatrix} -3/2\\1\\1\\0\\0 \end{pmatrix}, \begin{pmatrix} -3/2\\0\\1\\1\\0 \end{pmatrix}, \begin{pmatrix} -3/2\\1\\0\\0\\1 \end{pmatrix}, \begin{pmatrix} -3/2\\0\\0\\1\\1 \end{pmatrix} \right).$$

having dimension *q*2*(q*<sup>1</sup> + 1*)* = 20. For the output layer we choose the identity link *g(x)* = *x*, and the regression parameter *β* = *(*0*,* 1*,* −1*,* −1*,* 1*)*- <sup>∈</sup> <sup>R</sup>5. As a result, we obtain

$$\chi\_A(\mathbf{x}) = \left\langle \mathfrak{J}, \mathbf{z}^{(2:1)}(\mathbf{x}) \right\rangle. \tag{7.13}$$

That is, this network of depth *d* = 2, number of neurons *(q*1*, q*2*)* = *(*4*,* 4*)*, step function activation and identity link can perfectly replicate the indicator function for the square *A* = [−1*/*2*,* 1*/*2*)* × [−1*/*2*,* 1*/*2*)*, see Fig. 7.3. This network has *r* = 37 parameters.

We now consider a shallow FN network with *q*<sup>1</sup> neurons. The resulting regression function with identity link is given by

$$\begin{aligned} \left\langle \mathbf{x} \mapsto \left\langle \boldsymbol{\mathfrak{g}}, \boldsymbol{z}^{(1:1)}(\mathbf{x}) \right\rangle &= \left\langle \boldsymbol{\mathfrak{g}}, (1, z\_1^{(1)}(\mathbf{x}), \dots, z\_{q1}^{(1)}(\mathbf{x}))^\top \right\rangle \\ &= \left\langle \boldsymbol{\mathfrak{g}}, \left( 1, \mathbb{1}\_{\left\{ \left\langle \boldsymbol{w}\_1^{(1)}, \mathbf{x} \right\rangle \geq 0 \right\}}, \dots, \mathbb{1}\_{\left\{ \left\langle \boldsymbol{w}\_{q1}^{(1)}, \mathbf{x} \right\rangle \geq 0 \right\}} \right)^\top \right\rangle, \end{aligned}$$

where we have used the step function activation *φ(x)* <sup>=</sup> <sup>1</sup>{*x*≥0}. As in (7.9), each of these neurons leads to a partition of the space R<sup>2</sup> with a straight line. Importantly these straight lines go *across* the *entire feature space*, and, therefore, we cannot exactly construct the indicator function of Fig. 7.3 with a shallow FN network. This can nicely be seen in Fig. 7.4 (lhs), where we consider a shallow FN network with *q*<sup>1</sup> = 4 neurons, weights (7.12), and *β* = *(*0*,* 1*/*2*,* 1*/*2*,* −1*/*2*,* −1*/*2*)*-.

However, from the universality theorems we know that shallow FN networks can approximate any compactly supported (continuous) function arbitrarily well for sufficiently large *q*1. In this example we can introduce additional neurons and let the resulting hyperplanes rotate around the origin. In Fig. 7.4 (middle, rhs) we show this for *q*<sup>1</sup> = 8 and *q*<sup>1</sup> = 64 neurons. We observe that this allows us to approximate a circle, see Fig. 7.4 (rhs), and having circles of different sizes at

**Fig. 7.4** Shallow FN networks with *q*<sup>1</sup> = 4 (lhs), *q*<sup>1</sup> = 8 (middle) and *q*<sup>1</sup> = 64 (rhs)

different locations will allow us to approximate the square *A* considered above. However, of course, this is a much less efficient way compared to the deep FN network (7.13).

Intuitively speaking, shallow FN networks act like additions where we add more and more separating hyperplanes for *q*<sup>1</sup> → ∞ (*superposition of basis functions*). In contrast to that, going deep allows us to not only use additions but to also use multiplications (*composition of basis functions*). This is the reason, why we can easily construct the indicator function *χA* in the deep case (where we multiply zero's along the boundary of *A*), but not in the shallow case. -

## *7.2.3 Gradient Descent Methods*

We describe gradient descent methods in this section. These are used to fit FN networks. Gradient descent algorithms have already been used in Sect. 6.2.4 for fitting LASSO regularized regression models. We will give the full methodological part here, without relying on Sect. 6.2.4.

#### **Plain Vanilla Gradient Descent Algorithm**

Assume we have independent instances *(Yi, xi)*, 1 ≤ *i* ≤ *n*, that follow the same member of the EDF. We choose a regression function

$$
\mu\_i \mapsto \mu(\mathbf{x}\_i) = \mu\_\theta(\mathbf{x}\_i) = \mathbb{E}\_{\theta(\mathbf{x}\_i)}[Y\_i] = \operatorname{g}^{-1}\left\langle \mathfrak{F}, \mathbf{z}^{(d:1)}(\mathbf{x}\_i) \right\rangle,
$$

for a strictly monotone and smooth link function *g*, and a FN network *z(d*:1*)* with network parameter *<sup>ϑ</sup>* <sup>∈</sup> <sup>R</sup>*r*. We assume that the chosen activation function *<sup>φ</sup>* is differentiable. We highlight in the notation that the mean functional *μ<sup>ϑ</sup> (*·*)* depends on the network parameter *ϑ*. The canonical parameter of the response *Yi* is given by *θ (xi)* = *h(μ<sup>ϑ</sup> (xi))* ∈ , where *h* = *(κ )*−<sup>1</sup> is the canonical link and *κ* the cumulant function of the chosen member of the EDF. This gives us (under constant dispersion *ϕ*) the log-likelihood function, for given data *Y* = *(Y*1*,...,Yn)*-,

$$\mathfrak{d} \models \; \ell\_Y(\mathfrak{d}) = \sum\_{i=1}^n \frac{v\_i}{\varphi} \left[ Y\_l h(\mu\_\varPhi(\mathfrak{x}\_i)) - \kappa \left( h(\mu\_\varPhi(\mathfrak{x}\_l)) \right) \right] + a(Y\_l; v\_l/\varphi).$$

The deviance loss function in this model is given by, see (4.9) and (4.8),

$$\mathfrak{D}(\mathbf{Y}, \mathfrak{d}) = \frac{2}{n} \sum\_{i=1}^{n} \frac{\upsilon\_{i}}{\varphi} \left( Y\_{i} h\left( Y\_{i} \right) - \kappa\left( h\left( Y\_{i} \right) \right) - Y\_{i} h\left( \mu\_{\mathfrak{P}}\left( \mathbf{x}\_{i} \right) \right) + \kappa\left( h\left( \mu\_{\mathfrak{P}}\left( \mathbf{x}\_{i} \right) \right) \right) \right) \geq 0. \tag{7.14}$$

The MLE of *ϑ* is found by either maximizing the log-likelihood function or by minimizing the deviance loss function in *ϑ*. This problem cannot be solved in general because of complexity. Typically, the deviance loss function is non-convex in *ϑ* and it may have many local minimums. This is one of the reasons, why we are less ambitious here, and why we just try to find a network parameter *<sup>ϑ</sup>* which provides a "small" deviance loss <sup>D</sup>*(Y, <sup>ϑ</sup>)* for the given data *<sup>Y</sup>*. We discuss this further, below, in fact, this is a crucial point in FN network fitting that is related to *in-sample over-fitting* and, therefore, this point will require a broader discussion.

For the moment, we just try to find a network parameter *<sup>ϑ</sup>* that provides a small deviance loss <sup>D</sup>*(Y, <sup>ϑ</sup>)* for the given data *<sup>Y</sup>*. Gradient descent algorithms suggest that we try to step-wise locally improve our current position by changing the network parameter into the direction of the maximal local decrease of the deviance loss function. By assumption, our deviance loss function is differentiable in *ϑ*. This allows us to consider the following first order Taylor expansion in *ϑ*

$$\mathfrak{D}(Y,\widetilde{\mathfrak{Y}}) = \mathfrak{D}(Y,\mathfrak{Y}) + \nabla\_{\mathfrak{Y}}\mathfrak{D}(Y,\mathfrak{Y})^{\top} \left(\widetilde{\mathfrak{Y}} - \mathfrak{Y}\right) + o\left(\|\widetilde{\mathfrak{Y}} - \mathfrak{Y}\|\_{2}\right) \quad \text{ as } \|\widetilde{\mathfrak{Y}} - \mathfrak{Y}\|\_{2} \to 0.$$

This shows that the locally optimal change *<sup>ϑ</sup>* <sup>→</sup> \**<sup>ϑ</sup>* points into the opposite direction of the gradient of the deviance loss function. This motivates the following gradient descent step.

Assume that at algorithmic time *<sup>t</sup>* <sup>∈</sup> <sup>N</sup> we have a network parameter *<sup>ϑ</sup>(t )* <sup>∈</sup> <sup>R</sup>*r*. Choose a suitable *learning rate t*+<sup>1</sup> *<sup>&</sup>gt;* 0, and consider the gradient descent update

$$
\mathfrak{d}^{(t)} \mapsto \mathfrak{d}^{(t+1)} = \mathfrak{d}^{(t)} - \varrho\_{t+1} \nabla\_{\mathfrak{G}} \mathfrak{D}(Y, \mathfrak{d}^{(t)}).\tag{7.15}
$$

This gradient descent update gives us the new (smaller) deviance loss at algorithmic time *t* + 1

$$\mathfrak{D}(\mathbf{Y}, \mathfrak{d}^{(l+1)}) = \mathfrak{D}(\mathbf{Y}, \mathfrak{d}^{(l)}) - \varrho\_{l+1} \left\| \nabla\_{\mathfrak{d}} \mathfrak{D}(\mathbf{Y}, \mathfrak{d}^{(l)}) \right\|\_{2}^{2} + o\left(\varrho\_{l+1}\right) \qquad \text{for } \varrho\_{l+1} \downarrow 0.1$$

Under suitably tempered learning rates *(t)t*≥1, this algorithm will converge to a local minimum of the deviance loss function as *t* → ∞ (supposed that we do not get trapped in a saddlepoint).

*Remarks 7.4* We give a couple of (preliminary) remarks on the gradient descent algorithm (7.15), more explanation, further derivations, and variants of the gradient descent algorithm will be discussed below.


#### **Gradient Calculation via Back-Propagation**

Fast gradient descent algorithms essentially rely on fast gradient calculations of the deviance loss function. Under the EDF setup we have gradient w.r.t. *ϑ*

$$\nabla\_{\theta} \mathfrak{D}(Y, \mathfrak{d}) = \frac{2}{n} \sum\_{l=1}^{n} \frac{v\_{l}}{\varphi} \left( \mu\_{\theta}(\mathbf{x}\_{l}) - Y\_{l} \right) h' \left( \mu\_{\theta}(\mathbf{x}\_{l}) \right) \nabla\_{\theta} \mu\_{\theta}(\mathbf{x}\_{l}) \tag{7.16}$$
 
$$= \frac{2}{n} \sum\_{l=1}^{n} \frac{v\_{l}}{\varphi} \frac{\mu\_{\theta}(\mathbf{x}\_{l}) - Y\_{l}}{V \left( \mu\_{\theta}(\mathbf{x}\_{l}) \right)} \frac{1}{\mathbf{g}'(\mu\_{\theta}(\mathbf{x}\_{l}))} \nabla\_{\theta} \left\langle \mathfrak{B}, \mathbf{z}^{(d:1)}(\mathbf{x}\_{l}) \right\rangle,$$

where the last step uses the variance function *V (*·*)* of the chosen EDF, we also refer to (5.9). The main difficulty is the calculation of the gradient

$$
\nabla\_{\mathfrak{d}} \left\langle \mathfrak{g}, z^{(d:\mathbb{I})}(\mathfrak{x}) \right\rangle = \nabla\_{\mathfrak{d}} \left\langle \mathfrak{g}, \left( z^{(d)} \circ \cdots \circ z^{(\mathbb{I})} \right)(\mathfrak{x}) \right\rangle,
$$

w.r.t. the network parameter *<sup>ϑ</sup>* <sup>=</sup> *(w(*1*)* <sup>1</sup> *,..., <sup>w</sup>(d) qd , β)*- <sup>∈</sup> <sup>R</sup>*r*, and where each FN layer *<sup>z</sup>(m)* involves the weights *<sup>W</sup>(m)* <sup>=</sup> *(w(m)* <sup>1</sup> *,..., <sup>w</sup>(m) qm )* <sup>∈</sup> <sup>R</sup>*(qm*−1+1*)*×*qm* . The workhorse for these gradient calculations is the back-propagation method of Rumelhart et al. [324]. Basically, the back-propagation method is a clever reparametrization of the problem so that the gradients can be calculated more easily. We therefore modify the weight matrices *<sup>W</sup>(m)* by dropping the first row containing the intercept parameters *w(m)* <sup>0</sup>*,j* , 1 ≤ *j* ≤ *qm*. Define for 1 ≤ *m* ≤ *d* + 1

$$\mathcal{W}\_{(-0)}^{(m)} = \left( w\_{j\_{m-1}, j\_m}^{(m)} \right)\_{1 \le j\_{m-1} \le q\_{m-1}; \ 1 \le j\_m \le q\_m} \in \mathbb{R}^{q\_{m-1} \times q\_m},$$

where *w(m) jm*−1*,jm* denotes component *jm*−<sup>1</sup> of *<sup>w</sup>(m) jm* , and where we set *qd*+<sup>1</sup> = 1 (output dimension) and *w(d*+1*) jd ,*<sup>1</sup> = *βjd* for 0 ≤ *jd* ≤ *qd* .

**Proposition 7.5 (Back-Propagation for the Hyperbolic Tangent Activation)** *Choose a FN network of depth <sup>d</sup>* <sup>∈</sup> <sup>N</sup> *and with hyperbolic tangent activation function φ(x)* = tanh*(x).*

	- *– initialize qd*+<sup>1</sup> <sup>=</sup> <sup>1</sup> *and <sup>δ</sup>(d*+1*) (x)* <sup>=</sup> **<sup>1</sup>** <sup>∈</sup> <sup>R</sup>*qd*+<sup>1</sup> *;*
	- *– iterate for d* ≥ *m* ≥ 1

$$\mathfrak{d}^{(m)}(\mathfrak{x}) \,=\,\,\,\mathrm{diag}\left(1 - \left(z\_{j\_m}^{(m:1)}(\mathfrak{x})\right)^2\right)\_{1 \le j\_m \le q\_m} \mathcal{W}\_{(-0)}^{(m+1)}\mathfrak{d}^{(m+1)}(\mathfrak{x}) \,\,\in\,\,\mathbb{R}^{q\_m}.$$

• *We obtain for* 0 ≤ *m* ≤ *d*

$$\left(\frac{\partial \langle \mathfrak{g}, \mathfrak{z}^{(d:1)}(\mathbf{x}) \rangle}{\partial w\_{j\_m, j\_{m+1}}^{(m+1)}}\right)\_{0 \le j\_m \le q\_n; \ 1 \le j\_{m+1} \le q\_{n+1}} = \mathfrak{z}^{(m:1)}(\mathbf{x}) \mathfrak{z}^{(m+1)}(\mathbf{x})^\top \in \mathbb{R}^{(q\_n+1) \times q\_{m+1}},$$

*where z(*0:1*) (x)* <sup>=</sup> *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*q*0+<sup>1</sup> *and <sup>w</sup>(d*+1*)* <sup>1</sup> <sup>=</sup> *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*qd*+1*.*

*Proof of Proposition 7.5* Choose 1 ≤ *m* ≤ *d* and define for the neurons 1 ≤ *jm* ≤ *qm* the variables

$$
\zeta\_{j\_m}^{(m)}(\mathbf{x}) = \left\langle \mathbf{w}\_{j\_m}^{(m)}, \mathbf{z}^{(m-1:1)}(\mathbf{x}) \right\rangle.
$$

The learned representation in the *m*-th FN layer is obtained by activating these variables

$$\pi^{(m:1)}(\mathfrak{x}) = \left(1, \phi\left(\zeta\_1^{(m)}(\mathfrak{x})\right), \dots, \phi\left(\zeta\_{q\_m}^{(m)}(\mathfrak{x})\right)\right)^\vert \in \mathbb{R}^{q\_m+1}.$$

For the output we define

$$
\zeta\_1^{(d+1)}(\mathfrak{x}) = \langle \mathfrak{z}, z^{(d:1)}(\mathfrak{x}) \rangle.
$$

The main idea is to calculate the derivatives of *<sup>β</sup>, <sup>z</sup>(d*:1*) (x)* w.r.t. these new variables *ζ (m) <sup>j</sup> (x)*.

*Initialization for m* = *d* +1 This provides for *m* = *d* +1 and 1 ≤ *jd*+<sup>1</sup> ≤ *qd*+<sup>1</sup> = 1

$$\frac{\partial \langle \mathfrak{J}, z^{(d:1)}(\mathfrak{x}) \rangle}{\partial \zeta\_1^{(d+1)}(\mathfrak{x})} = 1 = \delta\_1^{(d+1)}(\mathfrak{x}) .$$

*Recursion for m<d*+1 Next, we calculate the derivatives w.r.t. *<sup>ζ</sup> (d) jd (x)*, for *m* = *d* and 1 ≤ *jd* ≤ *qd*. They are given by (note *qd*+<sup>1</sup> = 1)

$$\begin{split} \frac{\partial \langle \boldsymbol{\theta}, \boldsymbol{z}^{(d:1)}(\mathbf{x}) \rangle}{\partial \boldsymbol{\xi}\_{j\_d}^{(d)}(\mathbf{x})} &= \frac{\partial \langle \boldsymbol{\theta}, \boldsymbol{z}^{(d:1)}(\mathbf{x}) \rangle}{\partial \boldsymbol{\xi}\_1^{(d+1)}(\mathbf{x})} \frac{\partial \boldsymbol{\xi}\_1^{(d+1)}(\mathbf{x})}{\partial \boldsymbol{\xi}\_{j\_d}^{(d)}(\mathbf{x})} \\ &= \boldsymbol{\delta}\_1^{(d+1)}(\mathbf{x}) \, \boldsymbol{\theta}\_{j\_d} \, \boldsymbol{\phi}'(\boldsymbol{\xi}\_{j\_d}^{(d)}(\mathbf{x})) \\ &= \boldsymbol{\delta}\_1^{(d+1)}(\mathbf{x}) \, \boldsymbol{w}\_{j\_d, 1}^{(d+1)} \left( 1 - (\boldsymbol{z}\_{j\_d}^{(d:1)}(\mathbf{x}))^2 \right) \ = \boldsymbol{\delta}\_{j\_d}^{(d)}(\mathbf{x}), \end{split} \tag{7.17}$$

where we have used *w(d*+1*) jd ,*<sup>1</sup> = *βjd* and for the hyperbolic tangent activation function *<sup>φ</sup>* <sup>=</sup> <sup>1</sup> <sup>−</sup> *<sup>φ</sup>*2. Continuing recursively for *d>m* <sup>≥</sup> 1 and 1 <sup>≤</sup> *jm* <sup>≤</sup> *qm* we obtain

$$\begin{split} \frac{\partial \langle \boldsymbol{\mathcal{B}}, \boldsymbol{z}^{(d:1)}(\mathbf{x}) \rangle}{\partial \boldsymbol{\zeta}\_{j\_{m}}^{(m)}(\mathbf{x})} &= \sum\_{j\_{m+1}=1}^{q\_{m+1}} \frac{\partial \langle \boldsymbol{\mathcal{B}}, \boldsymbol{z}^{(d:1)}(\mathbf{x}) \rangle}{\partial \boldsymbol{\zeta}\_{j\_{m+1}}^{(m+1)}(\mathbf{x})} \frac{\partial \boldsymbol{\zeta}\_{j\_{m+1}}^{(m+1)}(\mathbf{x})}{\partial \boldsymbol{\zeta}\_{j\_{m}}^{(m)}(\mathbf{x})} \\ &= \sum\_{j\_{m+1}=1}^{q\_{m+1}} \delta\_{j\_{m+1}}^{(m+1)}(\mathbf{x}) \ w\_{j\_{m}, j\_{m+1}}^{(m+1)} \left(1 - (\boldsymbol{z}\_{j\_{m}}^{(m:1)}(\mathbf{x}))^{2}\right) \ = \delta\_{j\_{m}}^{(m)}(\mathbf{x}). \end{split}$$

Thus, the vectors *<sup>δ</sup>(m)(x)* <sup>=</sup> *(δ(m)* <sup>1</sup> *(x), . . . , δ(m) qm (x))* are calculated recursively in *<sup>d</sup>* <sup>≥</sup> *<sup>m</sup>* <sup>≥</sup> 1 with initialization *<sup>δ</sup>(d*+1*) (x)* = **1** and the recursion

$$\mathfrak{G}^{(m)}(\mathbf{x}) \, = \, \text{diag} \left( 1 - (z\_{j\_m}^{(m:1)}(\mathbf{x}))^2 \right)\_{1 \le j\_m \le q\_m} \mathcal{W}\_{(-0)}^{(m+1)} \, \mathfrak{G}^{(m+1)}(\mathbf{x}) \, \in \, \mathbb{R}^{q\_m}.$$

Finally, we need to show how these derivatives are related to the original derivatives in the gradient descent method. We have for 0 ≤ *jd* ≤ *qd* and *jd*+<sup>1</sup> = 1

$$\frac{\partial \langle \boldsymbol{\theta}, \boldsymbol{z}^{(d:\mathcal{l})}(\mathbf{x}) \rangle}{\partial \beta\_{j\_d}} = \frac{\partial \langle \boldsymbol{\theta}, \boldsymbol{z}^{(d:\mathcal{l})}(\mathbf{x}) \rangle}{\partial \boldsymbol{\xi}\_1^{(d+1)}(\mathbf{x})} \frac{\partial \boldsymbol{\xi}\_1^{(d+1)}(\mathbf{x})}{\partial \beta\_{j\_d}} = \boldsymbol{\delta}\_{j\_{d+1}}^{(d+1)}(\mathbf{x}) \, \boldsymbol{z}\_{j\_d}^{(d:\mathcal{l})}(\mathbf{x}).$$

For 1 ≤ *m<d*, and 0 ≤ *jm* ≤ *qm* and 1 ≤ *jm*+<sup>1</sup> ≤ *qm*+<sup>1</sup> we have

$$\frac{\partial \langle \boldsymbol{\theta}, \boldsymbol{z}^{(d:1)}(\mathbf{x}) \rangle}{\partial w\_{j\_m, j\_{m+1}}^{(m+1)}} = \frac{\partial \langle \boldsymbol{\theta}, \boldsymbol{z}^{(d:1)}(\mathbf{x}) \rangle}{\partial \boldsymbol{\xi}\_{j\_{m+1}}^{(m+1)}(\mathbf{x})} \frac{\partial \boldsymbol{\xi}\_{j\_{m+1}}^{(m+1)}(\mathbf{x})}{\partial w\_{j\_m, j\_{m+1}}^{(m+1)}} = \boldsymbol{\delta}\_{j\_{m+1}}^{(m+1)}(\mathbf{x}) \, \boldsymbol{z}\_{j\_m}^{(m:1)}(\mathbf{x}).$$

For *m* = 0, and 0 ≤ *l* ≤ *q*<sup>0</sup> and 1 ≤ *j*<sup>1</sup> ≤ *q*<sup>1</sup> we have

$$\frac{\partial \langle \boldsymbol{\theta}, \boldsymbol{z}^{(d:1)}(\mathbf{x}) \rangle}{\partial w\_{l,j\_1}^{(1)}} = \frac{\partial \langle \boldsymbol{\theta}, \boldsymbol{z}^{(d:1)}(\mathbf{x}) \rangle}{\partial \boldsymbol{\xi}\_{j\_1}^{(1)}(\mathbf{x})} \frac{\partial \boldsymbol{\xi}\_{j\_1}^{(1)}(\mathbf{x})}{\partial w\_{l,j\_1}^{(1)}} = \boldsymbol{\delta}\_{j\_1}^{(1)}(\mathbf{x}) \,\boldsymbol{x}\_l.$$

This completes the proof of Proposition 7.5.

*Remark 7.6* Proposition 7.5 gives the back-propagation method for the hyperbolic tangent activation function which has derivative *<sup>φ</sup>* <sup>=</sup> <sup>1</sup> <sup>−</sup> *<sup>φ</sup>*2. This becomes visible in the definition of *δ(m)(x)* where we consider the diagonal matrix

$$\text{diag}\left(1 - \left(z\_{j\_m}^{(m:1)}(\mathbf{x})\right)^2\right)\_{1 \le j\_m \le q\_m}$$

For a general differentiable activation function *φ* this needs to be replaced by, see (7.17),

$$\text{diag}\left(\phi'\left\langle w\_{j\_m}^{(m)}, z^{(m-1:1)}(\mathbf{x})\right\rangle\right)\_{1\le j\_m\le q\_m}$$

*.*

*.*

In the case of the sigmoid activation function this gives us, see also Table 7.1,

$$\text{diag}\left(z\_{j\_m}^{(m:1)}(\mathbf{x})\left(1-z\_{j\_m}^{(m:1)}(\mathbf{x})\right)\right)\_{1\le j\_m\le q\_m}$$

Plain vanilla gradient descent algorithm for FN networks

1. Choose an initial network parameter *<sup>ϑ</sup>(*0*)* <sup>∈</sup> <sup>R</sup>*r*.

2. Iterate for *t* ≥ 0 until a stopping criterion is met:


$$
\mathfrak{b}^{(l)} \mapsto \mathfrak{b}^{(l+1)} = \mathfrak{b}^{(l)} - \varrho\_{l+1} \nabla\_{\mathfrak{b}} \mathfrak{D}(Y, \mathfrak{b}^{(l)}) .
$$

$$\mathbf{u}$$

*Remark 7.7* The initialization *<sup>ϑ</sup>(*0*)* <sup>∈</sup> <sup>R</sup>*<sup>r</sup>* of the gradient descent algorithm needs some care. A FN network has many symmetries, for instance, we can permute neurons within a FN layer and we receive the same predictive model. For this reason, the initial network weights *<sup>W</sup>(m)* <sup>=</sup> *(w(m)* <sup>1</sup> *,..., <sup>w</sup>(m) qm )* <sup>∈</sup> <sup>R</sup>*(qm*−1+1*)*×*qm* , 1 ≤ *m* ≤ *d*, should not be chosen with identical components because this will result in a saddlepoint of the corresponding objective function, and gradient descent will not work. For this reason, these weights are initialized randomly either using a uniform or a Gaussian distribution. The former is related to the glorot\_uniform initializer in keras, <sup>2</sup> see (16) in Glorot–Bengio [160]. This initializer scales the support of the uniform distribution with the sizes of the FN layers that are connected by the corresponding weights *w(m) <sup>j</sup>* .

For the output parameter we usually set as initial value *<sup>β</sup>(*0*)* <sup>=</sup> *(β (*0*)* <sup>0</sup> *,* 0*,...,* 0*)*- ∈ R*qd*+1, where *β (*0*)* <sup>0</sup> is the MLE in the corresponding null model (not considering any features) and transformed to the chosen link *g*. This choice implies that the gradient descent algorithm starts in the null model, and any decrease in deviance loss can be seen as an improved in-sample loss of using the FN network regression structure over the null model.

#### **Stochastic Gradient Descent**

The gradient in (7.16) has two parts. We have a vector

$$w(\mathbf{y}) = \left(\frac{v\_i}{\varphi} \left(\mu\_{\varPhi}(\mathbf{x}\_i) - Y\_i\right) \frac{1}{V\left(\mu\_{\varPhi}(\mathbf{x}\_i)\right)} \frac{1}{\operatorname{g}'(\mu\_{\varPhi}(\mathbf{x}\_i))}\right)\_{1 \le i \le n}^{\top} \in \mathbb{R}^n,$$

and we have a matrix

$$\mathbf{M} = \left(\nabla\_{\theta}\left\langle\theta, z^{(d:1)}(\mathbf{x}\_{1})\right\rangle, \dots, \nabla\_{\theta}\left\langle\theta, z^{(d:1)}(\mathbf{x}\_{n})\right\rangle\right) \in \mathbb{R}^{r \times n}.$$

The gradient of the deviance loss function is obtained by the matrix multiplication

$$\nabla\_{\mathfrak{G}} \mathfrak{D}(Y, \mathfrak{Y}) \, = \, \frac{2}{n} \mathbf{M} \, \mathfrak{v}(Y).$$

Matrix multiplication can be very slow in numerical implementations if the sample size *n* is large. For this reason, one typically uses the *stochastic gradient descent* (SGD) method that does not consider the entire data *Y* = *(Y*1*,...,Yn)*- simultaneously.

<sup>2</sup> For our examples we use the R library keras [77] which is an API to TensorFlow [2].

For the SGD method one chooses a fixed *batch size <sup>b</sup>* <sup>∈</sup> <sup>N</sup>, and one randomly partitions the entire data *Y* into *(mini-)batches Y*1*,..., Y*'*n/b*( of approximately the same size *b* (up to cardinality). Each gradient descent update

$$
\mathfrak{d}^{(t)} \mapsto \mathfrak{d}^{(t+1)} = \mathfrak{d}^{(t)} - \varrho\_{t+1} \nabla\_{\mathfrak{d}} \mathfrak{D}(Y\_s, \mathfrak{d}^{(t)}),
$$

is then only based on the observations *Y<sup>s</sup>* in the corresponding batch 1 ≤ *s* ≤ '*n/b*(. Typically, one sequentially visits all batches, and screening each batch once is called an *epoch*. Thus, if we run the SGD algorithm over *K* epochs on batches of size *b* ≤ *n*, then we perform *K*'*n/b*( gradient descent steps.

Choosing batches of size *b* reduces the complexity of the matrix multiplication from *n* to *b*, and, henceforth, leads to much faster run times in one gradient descent step. On the other hand, batches should have a minimal size so that the gradient descent updates are not too erratic, i.e., if the batches are too small, the randomness in the data may point too often into a (completely) wrong direction for the optimal gradient descent step. For this reason, optimal batch sizes should be chosen carefully. For instance, if we study a low frequency claims count problem, say, with an expected frequency of *λ* = 10%, we can determine confidence bounds for parameter estimation. This will provide an estimate of a minimal batch size *b* for a reliable parameter estimate.

To have a few erratic steps in SGD, however, can also be beneficial, as long as there are not too many of those. Sometimes, the algorithm gets trapped in saddlepoints or in flat areas of the objective function (vanishing gradient problem). If this is the case, an erratic step may be beneficial because it may perturb the algorithm out of its bottleneck. In fact, often SGD has a better performance than the plain vanilla gradient descent algorithm that is based on the entire data *Y* because of these noisy contributions.

#### **Momentum-Based Gradient Descent Methods**

The gradient descent method only considers a first order Taylor expansion and one is tempted to consider higher order terms to improve the approximation. For instance, Newton's method uses a second order Taylor term by updating

$$\mathfrak{d}^{(t)} \mapsto \mathfrak{d}^{(t+1)} = \mathfrak{d}^{(t)} - \left(\nabla\_{\mathfrak{d}}^{2} \mathfrak{D}(Y, \mathfrak{d}^{(t)})\right)^{-1} \nabla\_{\mathfrak{d}} \mathfrak{D}(Y, \mathfrak{d}^{(t)}).\tag{7.18}$$

In many practical applications this calculation is not feasible as the Hessian ∇2 *<sup>ϑ</sup>*D*(Y, <sup>ϑ</sup>(t ))* cannot be calculated in a reasonable amount of time. Another (simple) way of considering the changes in the gradients is the *momentum-based gradient descent method* of Rumelhart et al. [324]. This is inspired by mechanics in physics and it is achieved by considering the gradients over several iterations of the algorithm (with exponentially decaying weights). Choose a momentum coefficient *<sup>ν</sup>* ∈ [0*,* <sup>1</sup>*)* and define the initial speed **<sup>v</sup>***(*0*)* <sup>=</sup> <sup>0</sup> <sup>∈</sup> <sup>R</sup>*r*.

Replace the gradient descent update (7.15) by

$$\mathbf{v}^{(t)} \mapsto \mathbf{v}^{(t+1)} = \nu \mathbf{v}^{(t)} - \varrho\_{l+1} \nabla\_{\boldsymbol{\theta}} \mathfrak{D}(\boldsymbol{Y}, \boldsymbol{\theta}^{(t)}),\tag{7.19}$$

$$
\mathfrak{d}^{(t)} \mapsto \mathfrak{d}^{(t+1)} = \mathfrak{d}^{(t)} + \mathbf{v}^{(t+1)}.\tag{7.20}
$$

For *ν* = 0 we have the plain vanilla gradient descent method, for *ν >* 0 we also memorize the previous gradients (with exponentially decaying weights). Typically this leads to better convergence properties.

Nesterov [284] has noticed that for convex functions the gradient descent updates may have a zig-zag behavior. Therefore, he proposed the so-called Nesterovaccelerated version

$$\mathbf{v}^{(t)} \mapsto \mathbf{v}^{(t+1)} = \nu \mathbf{v}^{(t)} - \varrho\_{t+1} \nabla\_{\mathfrak{Y}} \mathfrak{D}(\mathbf{Y}, \mathfrak{Y}^{(t)} + \nu \mathbf{v}^{(t)}),$$

$$\mathfrak{d}^{(t)} \mapsto \mathfrak{d}^{(t+1)} = \mathfrak{d}^{(t)} + \mathbf{v}^{(t+1)}.\tag{7.21}$$

Thus, the calculation of the momentum **<sup>v</sup>***(t*+1*)* uses a look-ahead *<sup>ϑ</sup>(t )* <sup>+</sup> *<sup>ν</sup>***v***(t )* in the gradient calculation (anticipating part of the next step). This provides for the update (7.21) the following equivalent versions, under reparametrization \**ϑ(t )* <sup>=</sup> *<sup>ϑ</sup>(t )* <sup>+</sup> *<sup>ν</sup>***v***(t )*,

$$\mathfrak{d}^{(t+1)} = \mathfrak{d}^{(t)} + \left(\nu \mathbf{v}^{(t)} - \varrho\_{l+1} \nabla\_{\mathfrak{V}} \mathfrak{D}(\mathbf{Y}, \mathfrak{d}^{(t)} + \nu \mathbf{v}^{(t)})\right)$$

$$= \mathfrak{d}^{(t)} + \left(\nu \mathbf{v}^{(t)} - \varrho\_{l+1} \nabla\_{\mathfrak{V}} \mathfrak{D}(\mathbf{Y}, \widetilde{\mathfrak{d}}^{(t)})\right) \tag{7.22}$$

$$= \widetilde{\mathfrak{d}}^{(t)} + \left(\nu \mathbf{v}^{(t+1)} - \varrho\_{l+1} \nabla\_{\mathfrak{V}} \mathfrak{D}(\mathbf{Y}, \widetilde{\mathfrak{d}}^{(t)})\right) - \nu \mathbf{v}^{(t+1)}.$$

For the Nesterov accelerated update we can also study, we use the last line of (7.22),

$$\begin{aligned} \mathbf{v}^{(t)} \mapsto \mathbf{v}^{(t+1)} &= \nu \mathbf{v}^{(t)} - \varrho\_{t+1} \nabla\_{\mathfrak{P}} \mathfrak{D}(\mathbf{Y}, \widetilde{\mathfrak{P}}^{(t)}), \\ \widetilde{\mathfrak{P}}^{(t)} \mapsto \widetilde{\mathfrak{P}}^{(t+1)} &= \widetilde{\mathfrak{P}}^{(t)} + \left( \nu \mathbf{v}^{(t+1)} - \varrho\_{t+1} \nabla\_{\mathfrak{P}} \mathfrak{D}(\mathbf{Y}, \widetilde{\mathfrak{P}}^{(t)}) \right). \end{aligned} \tag{7.23}$$

Compared to (7.19)–(7.20), we just shift the index by 1 in the momentum **v***(t )* in the round brackets of (7.23). The typical way how the Nesterov-acceleration is formulated is, yet, another equivalent formulation, namely, only in terms of *ϑ(t )* and \**ϑ(t )*. From the second line of (7.22) and (7.21) we have the updates

$$\mathfrak{d}^{(t+1)} = \widetilde{\mathfrak{d}}^{(t)} - \varrho\_{t+1} \nabla\_{\mathfrak{d}} \mathfrak{D}(Y, \widetilde{\mathfrak{d}}^{(t)}),$$

$$\widetilde{\mathfrak{d}}^{(t+1)} = \mathfrak{d}^{(t+1)} + \upsilon \left( \mathfrak{d}^{(t+1)} - \mathfrak{d}^{(t)} \right). \tag{7.24}$$

Typically, one chooses the momentum coefficient *ν* in (7.24) time-dependent by setting *νt* = *t/(t* + 3*)*.

In our applications we will use the R interface to the keras library [77]. This library has a couple of standard momentum-based gradient descent methods implemented which use pre-defined learning rates and momentum coefficients. In our analysis we are mainly relying on the variants rmsprop and the Nesterovaccelerated version of adam, called nadam. Therefore, we briefly describe these three variants, and for more information we refer to Sections 8.3 and 8.5 in Goodfellow et al. [166].

#### **Predefined Gradient Descent Methods**

• rmsprop stands for 'root mean square propagation', and its origin can be found in a lecture of Hinton et al. [187]. Denote by ) the Hadamard product that computes the component-wise products of two matrices. Choose a weight *<sup>α</sup>* <sup>∈</sup> *(*0*,* <sup>1</sup>*)* and calculate the accumulated squared gradients, set **<sup>r</sup>***(*0*)* <sup>=</sup> <sup>0</sup> <sup>∈</sup> <sup>R</sup>*r*,

$$\mathbf{r}^{(t)} \mapsto \mathbf{r}^{(t+1)} = \alpha \mathbf{r}^{(t)} + (1 - \alpha) \left( \nabla\_{\boldsymbol{\theta}} \mathfrak{D}(\mathbf{Y}, \boldsymbol{\theta}^{(t)}) \odot \nabla\_{\boldsymbol{\theta}} \mathfrak{D}(\mathbf{Y}, \boldsymbol{\theta}^{(t)}) \right) \in \mathbb{R}^{r}.$$

The sequence *(***r***(t ))t*≥<sup>1</sup> memorizes the (squared) magnitudes of the components of the gradients <sup>∇</sup>*ϑ*D*(Y, <sup>ϑ</sup>(t ))*, *<sup>t</sup>* <sup>≥</sup> 1. This is done individually for each component because we may have directional differences in magnitudes (and momentum). In contrast to (7.19), **r***(t )* does not model the speed, but rather an inverse weight. This then motivates the gradient descent update

$$
\mathfrak{d}^{(t)} \mapsto \mathfrak{d}^{(t+1)} = \mathfrak{d}^{(t)} - \frac{\varrho}{\sqrt{\varepsilon + \mathbf{r}^{(t+1)}}} \odot \nabla\_{\mathfrak{d}} \mathfrak{D}(Y, \mathfrak{d}^{(t)}),
$$

where the square-root is taken component-wise, for a global decay rate  *>* 0, and for a small positive constant *ε >* 0 to ensure that everything is well-defined. • adam stands for 'adaptive moment' estimation, and it has been proposed by Kingma–Ba [216]. The momentum is determined by the first two moments in adam, namely, we set **<sup>v</sup>***(*0*)* <sup>=</sup> **<sup>r</sup>***(*0*)* <sup>=</sup> <sup>0</sup> <sup>∈</sup> <sup>R</sup>*<sup>r</sup>* and we consider

$$\mathbf{v}^{(t)} \mapsto \mathbf{v}^{(t+1)} = \nu \mathbf{v}^{(t)} + (1 - \nu) \nabla\_{\boldsymbol{\theta}} \mathfrak{D}(\boldsymbol{Y}, \boldsymbol{\theta}^{(t)}),\tag{7.25}$$

$$\mathbf{r}^{(t)} \mapsto \mathbf{r}^{(t+1)} = \alpha \mathbf{r}^{(t)} + (1 - \alpha) \left( \nabla\_{\boldsymbol{\theta}} \mathfrak{D}(\mathbf{Y}, \boldsymbol{\theta}^{(t)}) \odot \nabla\_{\boldsymbol{\theta}} \mathfrak{D}(\mathbf{Y}, \boldsymbol{\theta}^{(t)}) \right), \tag{7.26}$$

for given weights *ν, α* <sup>∈</sup> *(*0*,* <sup>1</sup>*)*. Similar to Bayesian credibility theory, **<sup>v</sup>***(t )* and **r***(t )* are biased because these two processes have been initialized in zero. Therefore, they are rescaled by 1*/(*<sup>1</sup> <sup>−</sup> *<sup>ν</sup><sup>t</sup> )* and 1*/(*<sup>1</sup> <sup>−</sup> *<sup>α</sup><sup>t</sup> )*, respectively. This gives us the gradient descent update

$$
\mathfrak{d}^{(t)} \mapsto \mathfrak{d}^{(t+1)} = \mathfrak{d}^{(t)} - \frac{\varrho}{\varepsilon + \sqrt{\frac{\mathbf{r}^{(t+1)}}{1 - \alpha^t}}} \odot \frac{\mathbf{v}^{(t+1)}}{1 - \boldsymbol{\nu}^t},
$$

where the square-root is taken component-wise, for a global decay rate  *>* 0, and for a small positive constant *ε >* 0 to ensure that everything is well-defined.

• nadam is the Nesterov-accelerated [284] version of adam. Similarly as when going from (7.19)–(7.20) to (7.23), the acceleration is obtained by a shift of 1 in the velocity parameter, thus, consider the Nesterov-accelerated adam update

$$\mathfrak{d}^{(t)} \mapsto \mathfrak{d}^{(t+1)} = \mathfrak{d}^{(t)} - \frac{\varrho}{\varepsilon + \sqrt{\frac{\mathbf{r}^{(t+1)}}{1 - \mathfrak{a}^t}}} \odot \frac{\nu \mathbf{v}^{(t+1)} + (1 - \nu) \nabla\_{\mathfrak{d}} \mathfrak{D}(\mathbf{Y}, \mathfrak{d}^{(t)})}{1 - \nu^t},$$

using (7.25) and (7.26).

#### **Maximum Likelihood Estimation and Over-fitting**

As explained above, we model the mean of the datum *(Y, x)* by a deep FN network

$$\mathfrak{x} \mapsto \mu(\mathfrak{x}) = \mu\_{\mathfrak{\theta}}(\mathfrak{x}) = \mathbb{E}\_{\theta(\mathfrak{x})}[Y] = \mathfrak{g}^{-1}\left\langle \mathfrak{B}, \mathfrak{z}^{(d:1)}(\mathfrak{x}) \right\rangle,$$

for a network parameter *<sup>ϑ</sup>* <sup>∈</sup> <sup>R</sup>*r*. MLE of this network parameter requires solving for given data *Y*

$$
\widehat{\mathfrak{d}}^{\mathsf{MLE}} = \underset{\mathfrak{d}}{\arg\min} \mathfrak{D}(Y, \mathfrak{d}).
$$

In Fig. 7.5 we give a schematic figure of a loss surface *<sup>ϑ</sup>* <sup>→</sup> <sup>D</sup>*(Y, <sup>ϑ</sup>)* for a (lowdimensional) example *<sup>ϑ</sup>* <sup>∈</sup> <sup>R</sup>2. The two plots show the same loss surface from two different angles. This loss surface has three (local) minimums (red color), and the smallest one (global minimum) gives the MLE *<sup>ϑ</sup>*MLE.

In general, this global minimum cannot be found for more complex network architectures because the loss surface typically has a complicated structure for highdimensional parameter spaces. Is this a problem in FN network fitting? Not really! We are going to explain why. The universality theorems in Sect. 7.2.2 state that more complex FN networks have an excellent approximation capacity. If we translate this to our statistical modeling problem it means that the observations *Y* can be approximated arbitrarily well by sufficiently complex FN networks. In particular, for a given complex network architecture, the MLE *<sup>ϑ</sup>*MLE will provide the optimal fit of this architecture to the data *Y*, and, as a result, this network does not only reflect the systematic effects in the data but also the noisy part. This behavior is called *(in-sample) over-fitting* to the learning data *L*. It implies that such statistical models typically have a poor generalization to unseen (out-of-sample) test data *T* ; this is illustrated by the red color in Fig. 7.6. For this reason, in general, we are not interested in finding the MLE *<sup>ϑ</sup>*MLE of *<sup>ϑ</sup>* in FN network regression modeling, but we would like to find a parameter estimate *<sup>ϑ</sup>* that (only) extracts the systematic effects from the learning data *L*. This is illustrated by the different colors in Figs. 7.5

**Fig. 7.5** Schematic figure of a loss surface *<sup>ϑ</sup>* <sup>→</sup> <sup>D</sup>*(<sup>Y</sup> , <sup>ϑ</sup>)* from two different angles for a twodimensional parameter *<sup>ϑ</sup>* <sup>∈</sup> <sup>R</sup><sup>2</sup>

and 7.6, where we assume: (a) red color provides models with a poor generalization power due to over-fitting, (b) blue color provides models with a poor generalization power, too, because these parametrizations do not explain the systematic effects in the data at all (called under-fitting), and (c) green color gives good parametrizations that explain the systematic effects in the data and generalize well to unseen data. Thus, the aim is to find parametrizations that are in the green area of Fig. 7.5. This green area emphasizes that we lose the notion of uniqueness because there are infinitely many models in the green area that have a comparable generalization

0.5 1.0 1.5 2.0

x

power. Next we explain how we can exploit the gradient descent algorithm to make it useful for finding parametrizations in the green area.

*Remark 7.8* The loss surface considerations in Fig. 7.5 are based on a fixed network architecture. Recent research promotes the so-called Graph HyperNetwork (GHN) that is a (hyper-)network which tries to find the optimal network architecture and its parametrization by an additional network, we refer to Zhang et al. [402] and Knyazev et al. [219].

#### **Regularization Through Early Stopping**

As stated above, if we run the gradient descent algorithm with properly tempered learning rates it will converge to a local minimum of the loss function, which means that the resulting FN network over-fits to the learning data. For this reason we need to *early stop* the gradient descent algorithm beforehand. Coming back to Fig. 7.5, typically, we start the gradient descent algorithm somewhere in the blue area of the loss surface (supposed that the red area is a sparse set on the loss surface). Visually speaking, the gradient descent algorithm then walks down the valley (green, yellow and red area) by exploiting locally optimal steps. Since at the early stage of the algorithm the systematic effects play a dominant role over the noisy part, the gradient descent algorithm learns these systematic effects at this first stage (blue area in Fig. 7.5). When the algorithm arrives at the green area the noisy part in the data starts to increasingly influence the model calibration (gradient descent steps), and, henceforth, at this stage the algorithm should be stopped, and the learned parameter should be selected for predictive modeling. This early stopping is an implicit way of regularization, because it implies that we stop the parameter fitting before the parameters start to learn very individual features of the (noisy) data (and take extreme values).

This early stopping point is determined by doing an out-of-sample analysis. This requires the learning data *L* to be further split into *training data U* and *validation data V*. The training data *U* is used for gradient descent parameter learning, and the validation data *V* is used for tracking the over-fitting by an instantaneous (outof-sample) validation analysis. This partition is illustrated in Fig. 7.7, which also highlights that the validation data *V* is disjoint from the test data *T* , the latter only being used in the final step for comparing different statistical models (e.g., a GLM vs. a FN network). That is, model comparison is done in a proper out-of-sample manner on *T* , and each of these models is only fit on *U* and *V*. Thus, for FN network fitting with early stopping we need a reasonable amount of data that can be split into 3 sufficiently large data sets so that each is suitable for its purpose.

For early stopping we partition the learning data *L* into training data *U* and validation data *V*. The plain vanilla gradient descent algorithm can then be changed as follows.

**Fig. 7.7** Partition of entire data *D* (lhs) into learning data *L* and test data *T* (middle), and into training data *U*, validation data *V* and test data *T* (rhs)

Plain vanilla gradient descent algorithm with early stopping

	- (a) Calculate the gradient <sup>∇</sup>*ϑ*D*(U, <sup>ϑ</sup>)* in network parameter *<sup>ϑ</sup>* <sup>=</sup> *<sup>ϑ</sup>(t )* on the training data *U* using (7.16) and the back-propagation method of Proposition 7.5 (for the hyperbolic tangent activation function).
	- (b) Make the gradient descent step for a suitable learning rate *t*+<sup>1</sup> *>* 0

$$
\mathfrak{d}^{(t)} \mapsto \mathfrak{d}^{(t+1)} = \mathfrak{d}^{(t)} - \varrho\_{t+1} \nabla\_{\mathfrak{d}} \mathfrak{D}(\mathcal{U}, \mathfrak{d}^{(t)}) .
$$


$$\mathfrak{D}(\mathcal{V}, \mathfrak{d}^{(t)}) \, \, \, \, \mathfrak{D}(\mathcal{V}, \mathfrak{d}^{(t-1)}), \,\, \,\, \tag{7.27}$$

and return the learned parameter (estimate) *<sup>ϑ</sup>* <sup>=</sup> *<sup>ϑ</sup>(t*−1*)* .

In applications we use the SGD algorithm that can also have erratic steps because not all random (mini-)batches are necessarily typical representations of the data. In such cases we should use more sophisticated stopping criteria than (7.27), for instance, early stop if the validation loss increases five times in a row.

Figure 7.8 provides an example of the application of the SGD algorithm on training data *U* and validation data *V*. The training loss is in blue color and the validation loss in green color. We observe that the validation loss has its minimum after 52 epochs (orange vertical line), and hence the fitting algorithm should be stopped at this point. We give a couple of remarks concerning Fig. 7.8:

• The learning data *L* exactly corresponds to the claims frequency data of Sect. 5.2.4, see also Table 5.2. We take 10% as validation data which gives |*U*| = 549 185 and |*V*| = 61 021. For the SGD algorithm we use batches of size 10 000 which implies that one epoch corresponds to '549 185*/*10 000( = 54 gradient descent steps. For batches of size 10 000 we expect an approximate estimation precision on an average frequency of *λ*¯ = 7*.*36% in the Poisson model of

$$\left[\bar{\lambda} - 2\sqrt{\frac{\bar{\lambda}}{10'000\bar{v}}}, \; \bar{\lambda} + 2\sqrt{\frac{\bar{\lambda}}{10'000\bar{v}}}\right] = [6.62\%, 8.11\%],$$

with an average exposure *v*¯ = 0*.*5283 on our learning data, we also refer to Example 3.22.


We close this section with remarks.

## *Remarks 7.9*


## **7.3 Feed-Forward Neural Network Examples**

## *7.3.1 Feature Pre-processing*

Similarly to GLMs, we also need to pre-process the feature components in FN network regression modeling. The former Sect. 5.2.2 for GLMs has been called 'feature engineering' because we need to bring the feature components into an appropriate functional form w.r.t. the given regression task. The present section is called 'feature pre-processing' because we do not need to engineer the features for FN networks. We only need to bring them into a suitable (tabular) form to enter the network, and the network will then do an automated feature engineering through representation learning.

#### **Categorical Feature Components: One-Hot Encoding**

The categorical features have been treated by dummy coding within GLMs. Dummy coding provides full rank design matrices. For FN network regression modeling the


full rank property is not important because, anyway, we neither have a single (local) minimum in the objective function, nor do we want to calculate the MLE of the network parameter. Typically, in FN network regression modeling one uses onehot encoding for the categorical variables that encodes every level by a unit vector. Assume the raw feature component \**xj* is a categorical variable taking *<sup>K</sup>* different levels {*a*1*,...,aK*}. One-hot encoding is obtained by the embedding map

$$\widetilde{\mathbf{x}}\_{j} \mapsto \mathbf{x}\_{j} = (\mathbb{1}\_{\{\widetilde{\mathbf{x}}\_{j} = a\_{1}\}}, \dots, \mathbb{1}\_{\{\widetilde{\mathbf{x}}\_{j} = a\_{K}\}})^{\top} \in \{0, 1\}^{K}. \tag{7.28}$$

*a*<sup>9</sup> = green 0 0 0 0 0 0 0 0 1 0 0 *a*<sup>10</sup> = beige 0 0 0 0 0 0 0 0 0 1 0 *a*<sup>11</sup> = brown 0 0 0 0 0 0 0 0 0 0 1

An explicit example is given in Table 7.2 which should be compared to Table 5.1.

#### **Continuous Feature Components**

The continuous feature components do not need any pre-processing but they can directly enter the FN network which will take care of representation learning. However, an efficient use of gradient descent methods typically requires that all feature components live on a similar scale and that they are roughly uniformly spread across their domains. This makes gradient descent steps more efficient in exploiting the relevant directions.

One possibility is to use the MinMaxScaler. Let *x*− *<sup>j</sup>* and *x*<sup>+</sup> *<sup>j</sup>* be the minimal and maximal possible feature values of the continuous feature component *xj* , i.e., *xj* ∈ [*x*<sup>−</sup> *<sup>j</sup> , x*<sup>+</sup> *<sup>j</sup>* ]. We transform this continuous feature component to unit scale for all data 1 ≤ *i* ≤ *n* by

$$\mathbf{x}\_{l,j} \mapsto \mathbf{x}\_{l,j}^{\mathbf{M} \mathbf{M}} = 2 \frac{\mathbf{x}\_{l,j} - \mathbf{x}\_j^{-}}{\mathbf{x}\_j^{+} - \mathbf{x}\_j^{-}} - 1 \in [-1, 1]. \tag{7.29}$$

The resulting feature values *(x*MM *i,j )*1≤*i*≤*<sup>n</sup>* should roughly be uniformly spread across the interval [−1*,* 1]. If this is not the case, for instance, because we have outliers in the feature values, we may first transform them non-linearly to get more uniformly spread values. For example, we consider the Density of the car frequency example on the log scale.

An alternative to the MinMaxScaler is to consider normalization with the empirical mean *x*¯*<sup>j</sup>* and the empirical standard deviation *σ*ˆ*<sup>j</sup>* over all data *xi,j* . That is,

$$\mathbf{x}\_{l,j} \mapsto \mathbf{x}\_{l,j}^{\text{sd}} = \frac{\mathbf{x}\_{l,j} - \bar{\mathbf{x}}\_{j}}{\hat{\sigma}\_{j}}.\tag{7.30}$$

It depends on the application whether the MinMaxScaler or normalization with the empirical mean and standard deviation works better. Important in applications is that we use exactly the same values for the normalization of training data *U*, validation data *V* and test data *T* , to make the same network applicable to all these data sets. For notational convenience we will drop the upper index in *x*MM *i,j* or *x*sd *i,j* , respectively, and we throughout assume that all feature components are appropriately pre-processed.

## *7.3.2 Lab: Poisson FN Network for Car Insurance Frequencies*

We present a first FN network example applied to the French MTPL claim frequency data studied in Sect. 5.2.4. We assume that the claim counts *Ni* are independent and Poisson distributed with claim count density (5.26), where we replace the GLM regression function *x* → exp*β, x* by a FN network regression function

$$\mathfrak{x} \in \mathcal{X} \mapsto \mu(\mathfrak{x}) = \exp\langle \mathfrak{z}, \mathfrak{z}^{(d:\mathbb{I})}(\mathfrak{x}) \rangle.$$

We use a FN network of depth *d* = 3 having number of neurons *(q*1*, q*2*, q*3*)* = *(*20*,* 15*,* 10*)* and using the hyperbolic tangent activation function. We pre-process the categorical variables VehBrand and Region by one-hot encoding providing input dimensions 11 and 22, respectively. The binary variable VehGas is encoded as 0–1. Because of scarcity of data we right-censor the continuous variables VehAge at 20, DrivAge at 90 and BonusMalus at 150, and we transform Density to the log scale. We then apply to each of these (modified) continuous variables Area, VehPower, VehAge, DrivAge, BonusMalus and log*(*Density*)* a MinMaxScaler. This provides us with an input dimension *q*<sup>0</sup> = 11 + 22 + 1 + 6 = 40. The resulting FN network is illustrated in Fig. 7.2, with the one-hot encoded variables VehBrand in orange color and Region in magenta color. It has a network parameter *<sup>ϑ</sup>* <sup>∈</sup> <sup>R</sup>*<sup>r</sup>* of dimension *<sup>r</sup>* <sup>=</sup> <sup>1</sup> 306.

This network is implemented in R using the library keras [77]. The code is provided in Listing 7.1 and the resulting network architecture is summarized in Listing 7.2. This network is now fitted to the data. We use a batch size of 10'000, we use the nadam version of SGD, we use 10% of the learning data *L* as validation data *V* and the remaining 90% as training data *U*. We then run the corresponding

```
Listing 7.1 FN network of depth d = 3 using the R library keras [77]
```

```
1 library(keras)
2 #
3 Design = layer_input(shape = c(40), dtype = 'float32', name = 'Design')
4 Vol = layer_input(shape = c(1), dtype = 'float32', name = 'Vol')
5 #
6 Network = Design %>%
7 layer_dense(units=20, activation='tanh', name='FNLayer1') %>%
8 layer_dense(units=15, activation='tanh', name='FNLayer2') %>%
9 layer_dense(units=10, activation='tanh', name='FNLayer3') %>%
10 layer_dense(units=1, activation='exponential', name='Network',
11 weights=list(array(0, dim=c(10,1)), array(log(lambda0), dim=c(1))))
12 #
13 Response = list(Network, Vol) %>% layer_multiply(name='Multiply')
14 #
15 model = keras_model(inputs = c(Design, Vol), outputs = c(Response))
16 #
17 summary(model)
```



**Listing 7.3** Fitting a FN network using the R library keras [77]

```
1 path0 <- "path_for_callback"
2 CBs <- callback_model_checkpoint(path0, monitor = "val_loss", verbose = 0,
3 save_best_only = TRUE, save_weights_only = TRUE)
4 #
5 model %>% compile(loss = 'poisson', optimizer = 'nadam')
6 fit <- model %>% fit(list(Xlearn, Vlearn), Ylearn, validation_split=0.1,
7 batch_size=10000, epochs=1000, verbose=0, callbacks=CBs)
8 #
9 load_model_weights_hdf5(model, path0)
```


**Table 7.3** Run times, number of parameters, in-sample and out-of-sample deviance losses (units are in 10−2) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of Table 5.5 and the FN network model (with one-hot encoding of the categorical variables)

SGD algorithm and we retrieve the network with the lowest validation loss using a callback. This is illustrated in Listing 7.3. The fitting performance on the training and validation data is illustrated in Fig. 7.8, and we retrieve the network calibration after the 52th epoch because it has the lowest validation loss. The results are presented in Table 7.3.

From the results of Table 7.3 we conclude that the FN network outperforms model Poisson GLM3 (out-of-sample) since it has a (clearly) lower out-of-sample deviance loss on the test data *T* . This may indicate that there is an interaction between the feature components that has not been captured in the GLM. The run time of 51s corresponds to the run time until the minimal validation loss is reached, of course, in practice we need to continue beyond this minimal validation loss to ensure that we have really found the minimum. Finally, and importantly, we observe that this early stopped FN network calibration does not meet the balance property because the resulting average frequency of this fitted model of 6*.*96% is below the empirical frequency of 7*.*36%. This is a major deficiency of this FN network fitting approach, and this is going to be discussed further in Sect. 7.4.2, below.

We can perform a detailed analysis about different batch sizes, variants of SGD methods, run times, etc. We briefly summarize our findings; this summary is also based on the findings in Ferrario et al. [127]. We have fitted this model on batches of sizes 2'000, 5'000, 10'000 and 20'000, and it seems that a batch size around 5'000 has the best performance, both concerning out-of-sample performance and run time to reach the minimal validation loss. Comparing the different optimizers rmsprop, adam and nadam, a clear preference can be given to nadam: the resulting prediction accuracy is similar in all three optimizers (they all reach the green area in Fig. 7.5), but nadam reaches this optimal point in half of the time compared to rmsprop and adam.

We conclude by highlighting that different initial points *ϑ(*0*)* of the SGD algorithm will give different network calibrations, and differences can be considerable. This is discussed in Sect. 7.4.4, below. Moreover, we could explore different network architectures, more simple ones, more complex ones, different activation functions, etc. The results of these different architectures will not be essentially different from our results, as long as the networks are above a minimal complexity bound. This closes our first example on FN networks and this example is the benchmark for refined versions that are presented in the subsequent sections.

## **7.4 Special Features in Networks**

## *7.4.1 Special Purpose Layers*

So far, our networks consist of stacked FN layers, and information is passed in a directed acyclic feed-forward path from one to the next FN layer. In this section we discuss special purpose layers that perform a specific task in a FN network. These include *embedding layers*, *drop-out layers* and *normalization layers*. These modules should be seen as add-ons to the FN layers. Besides these add-ons, there are also *recurrent layers* and *convolutional layers*. These two types of layers are going to be discussed in own chapters, below, because their importance goes beyond just being add-ons to the FN layers.

#### **Embedding Layers for Categorical Feature Components**

The categorical feature components have been treated either by dummy coding or by one-hot encoding, and this has resulted in numerous network parameters in the first FN layer, see Fig. 7.2. Natural language processing (NLP) treats categorical feature components differently, namely, it *embeds* categorical feature components (or words in NLP) into a Euclidean space R*<sup>b</sup>* of a *small* dimension *b*. This small dimension *b* is a hyper-parameter that has to be selected by the modeler, and which, typically, is selected much smaller than the total number of levels of the categorical feature. This embedding technique is quite common in NLP, see Bengio et al. [27– 29], but it goes beyond NLP applications, see Guo–Berkhahn [176], and it has been introduced to the actuarial community by Richman [312, 313] and the tutorial of Schelldorfer–Wüthrich [329].

We assume the same set-up as in dummy coding (5.21) and in one-hot encoding (7.28), namely, that we have a raw categorical feature component \**xj* taking *<sup>K</sup>* different levels {*a*1*,...,aK*}. In one-hot encoding these *K* levels are mapped to the *K* unit vectors of the Euclidean space R*K*, and consequently all levels have the same mutual Euclidean distance. This does not seem to be the best way of comparing the different levels because in our regression analysis we would like to identify the levels that are more similar w.r.t. the regression task and, thus, these should cluster. For an *embedding layer* one chooses a Euclidean space R*<sup>b</sup>* of a dimension *b<K*, typically being (much) smaller than *K*. One then considers the *embedding map*

$$\mathfrak{e}: \{a\_1, \ldots, a\_K\} \to \mathbb{R}^b, \qquad a\_k \mapsto \mathfrak{e}(a\_k) \stackrel{\text{def.}}{=} \mathfrak{e}^{(k)}.\tag{7.31}$$

That is, every level *ak* receives a vector representation *<sup>e</sup>(k)* <sup>∈</sup> <sup>R</sup>*<sup>b</sup>* which is lower dimensional than its one-hot encoding counterpart in R*K*. Proximity of the representations *e(k)* and *e(k )* in R*b*, i.e., of two levels *ak* and *ak* , should be related to similarity w.r.t. the regression task at hand. Such an embedding involves *K*

**Fig. 7.9** (lhs) One-hot encoding with *q*<sup>0</sup> = 40, and (rhs) embedding layers for VehBrand and Region with embedding dimension *b* = 2 and *q*<sup>0</sup> = 11; the remaining network architecture is identical with *(q*1*, q*2*, q*3*)* = *(*20*,* 15*,* 10*)* for depth *d* = 3

vectors *<sup>e</sup>(k)* <sup>∈</sup> <sup>R</sup>*<sup>b</sup>* of dimension *<sup>b</sup>*, thus, it involves *Kb* parameters, called *embedding weights*.

In network modeling, these embedding weights *e(*1*) ,..., e(K)* can also be learned during gradient descent training. Basically, it just means that for the categorical variables we add an additional embedding layer before the first FN layer *z(*1*)* , i.e., we increase the depth of the network by 1 for the categorical feature components (by a layer that is not fully connected). This is illustrated in Fig. 7.9 (rhs) for the French MTPL insurance example of Sect. 7.3.2. The graph on the left-hand side shows the network if we apply one-hot encoding to the categorical variables VehBrand and Region; this results in a network parameter of dimension *r* = 1 306. The graph on the right-hand side first embeds VehBrand and Region into two 2-dimensional spaces, illustrated by the orange and magenta circles. These embeddings are concatenated with the remaining feature components, which then provides a new dimension *q*<sup>0</sup> = 7 + 2 + 2 = 11 in that example. This results in a network parameter of dimension *r* = 726 + 22 + 44 = 792, where 22 + 44 = 66 stands for the 2-dimensional embedding weights of the 11 VehBrands and the 22 French Regions, see Listing 7.5.

*Example 7.10 (Embedding Layers for Categorical Features)* We revisit the example of Sect. 7.3.2, but we replace one-hot encoding of the categorical variables by embedding layers of dimension *b* = 2. The corresponding R code is given in Listing 7.4 and the resulting model is illustrated in Listing 7.5 and Fig. 7.9 (rhs).

Apart from replacing one-hot encoding by embedding layers, we use exactly the same FN network architecture as in Sect. 7.3.2 and we apply the same fitting strategy in terms of batch sizes, optimizer and early stopping strategy. The results are presented in Table 7.4.

**Listing 7.4** FN network of depth *d* = 3 using embedding layers

```
1 Design = layer_input(shape = c(7), dtype = 'float32', name = 'Design')
2 VehBrand = layer_input(shape = c(1), dtype = 'int32', name = 'VehBrand')
3 Region = layer_input(shape = c(1), dtype = 'int32', name = 'Region')
4 Vol = layer_input(shape = c(1), dtype = 'float32', name = 'Vol')
5 #
6 BrandEmb = VehBrand %>%
7 layer_embedding(input_dim=11,output_dim=2,input_length=1,name='BrandEmb') %>%
8 layer_flatten(name='Brand_flat')
9 RegionEmb = Region %>%
10 layer_embedding(input_dim=22,output_dim=2,input_length=1,name='RegionEmb') %>%
11 layer_flatten(name='Region_flat')
12 #
13 Network = list(Design,BrandEmb,RegionEmb) %>% layer_concatenate(name='concate') %>%
14 layer_dense(units=20, activation='tanh', name='FNLayer1') %>%
15 layer_dense(units=15, activation='tanh', name='FNLayer2') %>%
16 layer_dense(units=10, activation='tanh', name='FNLayer3') %>%
17 layer_dense(units=1, activation='exponential', name='Network',
18 weights=list(array(0, dim=c(10,1)), array(log(lambda0), dim=c(1))))
19 #
20 Response = list(Network, Vol) %>% layer_multiply(name='Multiply')
21 #
22 model = keras_model(inputs = c(Design, VehBrand, Region, Vol),
23 outputs = c(Response))
```
**Table 7.4** Run times, number of parameters, in-sample and out-of-sample deviance losses (units are in 10−2) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of Table 5.5 and the FN network models (with one-hot encoding and embedding layers of dimension *b* = 2, respectively)


A first remark is that the model calibration takes longer using embedding layers compared to one-hot encoding. The main reason for this is that having an embedding layer increases the depth of the network by one layer, as can be seen from Fig. 7.9. Therefore, the back-propagation takes more time, and the convergence is slower requiring more gradient descent steps. We have less over-fitting as can be seen from Fig. 7.10. The final fitted model has a slightly better out-of-sample performance compared to the one-hot encoding one. However, this slight improvement in the performance should not be overstated because, as explained in Remarks 7.9, there are a couple of elements of randomness involved in SGD fitting, and choosing a different seed may change the results. We remark that the balance property is not fulfilled because the average frequency of the fitted model does not meet the empirical frequency, see the last column of Table 7.4; we come back to this in Sect. 7.4.2, below.


**Listing 7.5** Summary of FN network of Fig. 7.9 (rhs) using embedding layers of dimension *b* = 2

**Fig. 7.10** Training loss <sup>D</sup>*(U, <sup>ϑ</sup>(t ))* vs. validation loss <sup>D</sup>*(V, <sup>ϑ</sup>(t ))* over different iterations *t* ≥ 0 of the SGD algorithm in the deep FN network with embedding layers for categorical variables

**Fig. 7.11** Embedding weights *<sup>e</sup>*VehBrand <sup>∈</sup> <sup>R</sup><sup>2</sup> and *<sup>e</sup>*Region <sup>∈</sup> <sup>R</sup><sup>2</sup> of the categorical variables VehBrand and Region for embedding dimension *b* = 2

A major advantage of using embedding layers for the categorical variables is that we receive a continuous representation of nominal variables, where proximity can be interpreted as similarity for the regression task at hand. This is nicely illustrated in Fig. 7.11 which shows the resulting 2-dimensional embeddings *<sup>e</sup>*VehBrand <sup>∈</sup> <sup>R</sup><sup>2</sup> and *<sup>e</sup>*Region <sup>∈</sup> <sup>R</sup><sup>2</sup> of the categorical variables VehBrand and Region. The Region embedding *<sup>e</sup>*Region <sup>∈</sup> <sup>R</sup><sup>2</sup> shows surprising similarities with the French map, for instance, Paris region R11 is adjacent to R23, R22, R21, R26, R24 (which is also the case in the French map), the Isle of Corsica R94 and the South of France R93, R91 and R73 are well separated from other regions, etc. Similar observations can be made for the embedding of VehBrand, Japanese cars B12 are far apart from the other cars, cars B1, B2, B3 and B6 (Renault, Nissan, Citroen, Volkswagen, Audi, Skoda, Seat and Fiat) cluster, etc. -

#### **Drop-Out Layers and Regularization**

Above, over-fitting to the learning data has been taken care of by early stopping. In view of Sect. 6.2 one could also use regularization. This can easily be obtained by replacing (7.14), for instance, by the following *Lp*-regularized counterpart

$$\mathfrak{b} \mapsto \frac{2}{n} \sum\_{l=1}^{n} \frac{\upsilon\_{l}}{\varphi} \left( Y\_{l} h\left( Y\_{l} \right) - \kappa\left( h\left( Y\_{l} \right) \right) - Y\_{l} h\left( \mu\_{\mathfrak{b}}\left( \mathbf{x}\_{l} \right) \right) + \kappa\left( h\left( \mu\_{\mathfrak{b}}\left( \mathbf{x}\_{l} \right) \right) \right) \right) + \lambda \left\| \mathfrak{b} \right\|\_{p}^{p},$$

for some *p* ≥ 1, regularization parameter *λ >* 0 and where the reduced network parameter *<sup>ϑ</sup>*<sup>−</sup> <sup>∈</sup> <sup>R</sup>*r*−<sup>1</sup> excludes the intercept parameter *<sup>β</sup>*<sup>0</sup> of the output layer, we also refer to (6.4) in the context of GLMs. For grouped penalty terms we refer to (6.21). The difficulty with this approach is the tuning of the regularization parameter(s) *λ*: run time is one issue, suitable grouping is another issue, and nonuniqueness of the optimal network a further one that can substantially distort the selection of reasonable regularization parameters.

A more popular method to prevent from over-fitting individual neurons in a FN layer to a certain task are so-called *drop-out layers*. A drop-out layer is an additional layer between FN layers that removes at random during gradient descent training neurons from the network, i.e., in each gradient descent step, any of the earmarked neurons is offset independently from the others with a fixed probability *δ* ∈ *(*0*,* 1*)*. This random removal will imply that the composite of the remaining neurons needs to be sufficiently well balanced to take over the role of the dropped-out neurons. Therefore, a single neuron cannot be over-trained to a certain task because it needs to be able play several different roles. Drop-out has been introduced by Srivastava et al. [345] and Wager et al. [373].

**Listing 7.6** FN network of depth *d* = 3 using a drop-out layer, ridge regularization and a normalization layer

```
1 Network = list(Design,BrandEmb,RegionEmb) %>%
2 layer_concatenate(name='concate') %>%
3 layer_dense(units=20, activation='tanh', name='FNLayer1') %>%
4 layer_dropout (rate = 0.01) %>%
5 layer_dense(units=15, kernel_regularizer=regularizer_l2(0.0001),
6 activation='tanh', name='FNLayer2') %>%
7 layer_batch_normalization() %>%
8 layer_dense(units=10, activation='tanh', name='FNLayer3') %>%
9 layer_dense(units=1, activation='exponential', name='Network',
10 weights=list(array(0, dim=c(10,1)), array(log(lambda0), dim=c(1))))
```
Listing 7.6 gives an example, where we add a drop-out layer with a drop-out probability of *δ* = 0*.*01 after the first FN layer, and in the second FN layer we apply ridge regularization to the weights *(w(*2*)* 1*,*1*,...,w(*2*) <sup>q</sup>*1*,q*<sup>2</sup> *)*, i.e., excluding the intercepts *w(*2*)* <sup>0</sup>*,j* , 1 ≤ *j* ≤ *q*2. Both the drop-out layer and regularization are only used during the gradient descent fitting, and these network features are disabled during the prediction.

Drop-out is closely related to ridge regularization as the following linear Gaussian regression example shows; this consideration is taken from Section 18.6 of Efron–Hastie [117]. Assume we have a linear regression problem with square loss function

$$\mathfrak{D}(Y,\mathfrak{f}) = \frac{1}{2} \sum\_{i=1}^{n} \left( Y\_i - \langle \mathfrak{f}, \mathbf{x}\_i \rangle \right)^2.$$

We assume in this Gaussian case that the observations and the features are standardized, see Sect. 6.2.4. This means that *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *Yi* <sup>=</sup> 0, *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *xi,j* <sup>=</sup> <sup>0</sup> and *n*−<sup>1</sup> *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>x</sup>*<sup>2</sup> *i,j* = 1, for all 1 ≤ *j* ≤ *q*. This standardization implies that we can omit the intercept parameter *β*<sup>0</sup> because its MLE is equal to 0.

We introduce i.i.d. drop-out random variables *Ii,j* for 1 ≤ *i* ≤ *n* and 1 ≤ *j* ≤ *q* with *(*1 − *δ)Ii,j* being Bernoulli distributed with probability 1 − *δ* ∈ *(*0*,* 1*)*. This scaling implies <sup>E</sup>[*Ii,j* ] = 1. Using these Bernoulli random variables we modify the above square loss function to

$$\mathfrak{D}\_I(Y,\mathfrak{F}) = \frac{1}{2} \sum\_{i=1}^n \left( Y\_i - \sum\_{j=1}^q \beta\_j I\_{i,j} \mathbf{x}\_{i,j} \right)^2,$$

i.e., every individual component *xi,j* can drop out independently of the others. Gaussian MLE requires to set the gradient of <sup>D</sup>*<sup>I</sup> (Y, <sup>β</sup>)* w.r.t. *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* equal to zero. The average score equation is given by (we average over the drop-out random variables *Ii,j* )

$$\mathbb{E}\_{\delta}\left[\left.\nabla\_{\boldsymbol{\theta}}\mathfrak{D}\_{I}(\boldsymbol{Y},\boldsymbol{\theta})\right|\boldsymbol{Y}\right] = -\mathfrak{X}^{\top}\boldsymbol{Y} + \mathfrak{X}^{\top}\mathfrak{X}\boldsymbol{\theta} + \frac{\delta}{1-\delta}\operatorname{diag}\left(\sum\_{l=1}^{n}\boldsymbol{x}\_{l,1}^{2}, \ldots, \sum\_{l=1}^{n}\boldsymbol{x}\_{l,q}^{2}\right)\mathfrak{R}^{\top}$$

$$= -\mathfrak{X}^{\top}\boldsymbol{Y} + \mathfrak{X}^{\top}\mathfrak{X}\boldsymbol{\theta} + \frac{\delta n}{1-\delta}\mathfrak{\boldsymbol{\theta}} \stackrel{!}{=} 0,$$

where we have used the normalization of the columns of the design matrix <sup>X</sup> <sup>∈</sup> R*n*×*<sup>q</sup>* (we drop the intercept column). This is ridge regression in the linear Gaussian case with a regularization parameter *λ* = *δ/(*2*(*1 − *δ)) >* 0 for *δ* ∈ *(*0*,* 1*)*, see (6.9).

#### **Normalization Layers**

In (7.29) and (7.30) we have discussed that the continuous feature components should be pre-processed so that all components live on the same scale, otherwise the gradient descent fitting may not be efficient. A similar phenomenon may occur with the learned representations *z(m*:1*) (xi)* in the FN layers 1 ≤ *m* ≤ *d*. In particular, this is the case if we choose an unbounded activation function *φ*. For this reason, it can be advantageous to rescale the components *z (m*:1*) <sup>j</sup> (xi)*, 1 ≤ *j* ≤ *qm*, in a given FN layer back to the same scale. To achieve this, a normalization step (7.30) is applied to every neuron *z (m*:1*) <sup>j</sup> (xi)* over the given cases *i* in the considered (mini-)batch. This involves two more parameters (for the empirical mean and the empirical standard deviation) in each neuron of the corresponding FN layer. Note, however, that all these operations are of a linear nature. Therefore, they do not affect the predictive model (i.e., these operations cancel in the scalar products in (7.6)), but they may improve the performance of the gradient descent algorithm.

The code in Listing 7.6 uses a normalization layer on line 6. In our applications, it has not been necessary to use these normalization layers, as it has not led to better run times in SGD algorithms; note that our networks are not very deep and they use the symmetric and bounded hyperbolic tangent activation function.

## *7.4.2 The Balance Property in Neural Networks*

We have seen in Table 7.4 that our FN network outperforms the GLM for claim frequency prediction in terms of a lower out-of-sample loss. We interpret this as follows. Feature engineering has not been done in the most optimal way for the GLM because the FN network finds modeling structure that is not present in the selected GLM. As a consequence, the FN network provides a better generalization to unseen data, i.e., we can better predict new data on a granular level with the FN network. However, having a more precise model on an individual policy level does not necessarily imply that the model also performs better on a global portfolio level. In our example we see that we may have smaller errors on an individual policy level, but these smaller errors do not aggregate to a more precise model in the average portfolio frequency. In our case, we have a misspecification of the average portfolio frequency, see the last column of Table 7.4. This is a major deficiency in insurance pricing because it may result in a misspecification of the overall price level, and this requires a correction. We call this correction *bias regularization*.

#### **Simple Bias Regularization**

The straightforward correction is to adjust the intercept parameter *<sup>β</sup>*<sup>0</sup> <sup>∈</sup> <sup>R</sup> accordingly. That is, compare the empirical mean

$$
\bar{\mu} = \frac{\sum\_{l=1}^{n} \upsilon\_{l} Y\_{l}}{\sum\_{l=1}^{n} \upsilon\_{l}},
$$

to the model average of the fitted FN network

$$
\widehat{\mu} = \frac{\sum\_{l=1}^{n} \upsilon\_{l} \mu\_{\widehat{\mathfrak{G}}}(\mathbf{x}\_{l})}{\sum\_{l=1}^{n} \upsilon\_{l}},
$$

where *<sup>ϑ</sup>* <sup>=</sup> *(<sup>w</sup> (*1*)* <sup>1</sup> *,..., <sup>w</sup> (d) qd , <sup>β</sup>)*- <sup>∈</sup> <sup>R</sup>*<sup>r</sup>* is the learned network parameter from the (early stopped) SGD algorithm. The output of this fitted model reads as

$$\mu\_l \mapsto \mu\_{\widehat{\theta}}(\mathbf{x}\_l) = \mathbf{g}^{-1} \left< \widehat{\boldsymbol{\beta}}, \widehat{\boldsymbol{z}}^{(d:\mathcal{l})}(\mathbf{x}\_l) \right> = \mathbf{g}^{-1} \left( \widehat{\beta}\_0 + \sum\_{j=1}^{q\_d} \widehat{\beta}\_j \widehat{z}\_j^{(d:\mathcal{l})}(\mathbf{x}\_l) \right),$$

where the hat in *<sup>z</sup>(d*:1*)* indicates that we use the estimated weights *<sup>w</sup> (m) <sup>l</sup>* , 1 ≤ *l* ≤ *qm*, 1 ≤ *m* ≤ *d*, in the FN layers. The balance property can be rectified by replacing *β* 0 by the solution *β* <sup>0</sup> of the following identity

$$\sum\_{i=1}^{n} v\_i Y\_i \stackrel{!}{=} \sum\_{i=1}^{n} v\_i g^{-1} \left( \widehat{\beta}\_0 + \sum\_{j=1}^{q\_d} \widehat{\beta}\_j \widehat{z}\_j^{(d:1)}(\mathbf{x}\_i) \right).$$

Since *g*−<sup>1</sup> is continuous and strictly monotone, there is a unique solution to this requirement supposed that the range of *g*−<sup>1</sup> covers the support of the *Yi*'s. If we work with the log-link *g(*·*)* = log*(*·*)*, this can easily be solved and we obtain

$$
\widehat{\hat{\beta}}\_0 = \hat{\beta}\_0 + \log\left(\frac{\bar{\mu}}{\hat{\mu}}\right).
$$

#### **Sophisticated Bias Regularization Under the Canonical Link Choice**

If we work with the canonical link *g* = *h* = *(κ )*−1, we can do better because the MLE of such a GLM automatically provides the balance property, see Corollary 5.7. Choose the SGD learned network parameter *<sup>ϑ</sup>* <sup>=</sup> *(<sup>w</sup> (*1*)* <sup>1</sup> *,..., <sup>w</sup> (d) qd , <sup>β</sup>)*- <sup>∈</sup> <sup>R</sup>*r*. Denote by *<sup>z</sup>(d*:1*)* the fitted network architecture that is based on the estimated weights *<sup>w</sup> (*1*)* <sup>1</sup> *,..., <sup>w</sup> (d) qd* . This allows us to study the learned representations of the raw features *x*1*,..., x<sup>n</sup>* in the last FN layer. We denote these learned representations by

$$
\widehat{\mathbf{z}}\_{1} = \widehat{\mathbf{z}}^{(d:\mathbb{I})}(\mathbf{x}\_{1}), \ \dots, \ \widehat{\mathbf{z}}\_{n} = \widehat{\mathbf{z}}^{(d:\mathbb{I})}(\mathbf{x}\_{n}) \ \in \{\mathbb{I}\} \times \mathbb{R}^{q\_{d}}.\tag{7.32}
$$

These learned representations can be used as new features to explain the response *Y*. We define the feature engineered design matrix by

$$\widehat{\mathfrak{X}} = (\widehat{\mathfrak{z}}\_1, \dots, \widehat{\mathfrak{z}}\_n)^\top \in \mathbb{R}^{n \times (qd+1)} \_{\perp}$$

Based on this new design matrix <sup>X</sup> we can run a classical GLM receiving a unique MLE *<sup>β</sup>*MLE <sup>∈</sup> <sup>R</sup>*qd*+<sup>1</sup> supposed that this design matrix has a full rank *qd* <sup>+</sup> <sup>1</sup> <sup>≤</sup> *<sup>n</sup>*, see Proposition 5.1. Since we work with the canonical link, this re-calibrated FN network will automatically satisfy the balance property, and the resulting regression function reads as

$$\mathbf{x} \mapsto \widehat{\boldsymbol{\mu}}(\mathbf{x}) = \boldsymbol{h}^{-1} \left\langle \widehat{\boldsymbol{\mathcal{B}}}^{\text{MLE}}, \widehat{\boldsymbol{z}}^{(d:1)}(\mathbf{x}) \right\rangle. \tag{7.33}$$

This is the proposal of Wüthrich [390]. We give some remarks.

*Remarks 7.11*


*Example 7.12 (Balance Property in Networks)* We apply this additional MLE step to the two FN networks of Table 7.4. Note that in these two examples we consider a Poisson model using the canonical link for *g*, thus, the resulting adjusted network (7.33) will automatically satisfy the balance property, see Corollary 5.7.

**Listing 7.7** Balance property adjustment (7.33)

```
1 glm.formula <- function(nn){
2 string <- "yy ~ X1"
3 if (nn>1){for (ll in 2:nn){ string <- paste(string, "+X",ll, sep="")}}
4 string
5 }
6 #
7 zz <- keras_model(inputs=model$input,
8 outputs=get_layer(model, 'FNLayer3')$output)
9 xx.learn <- data.frame(zz %>% predict(list(Xlearn, Vlearn)))
10 q3 <- ncol(xx.learn)
11 xx.learn$yy <- Ylearn
12 xx.learn$Exposure <- learn$Exposure
13 #
14 glm1 <- glm(as.formula(glm.formula(q3)),
15 data=xx.learn, offset=log(Exposure), family=poisson())
16
17 #
18 w1 <- get_weights(model)
19 w1[[7]] <- array(glm1$coefficients[2:(q3+1)], dim=c(q3,1))
20 w1[[8]] <- array(glm1$coefficients[1], dim=c(1))
21 set_weights(model, w1)
```
In Listing 7.7 we illustrate the necessary code that has to be added to Listings 7.1–7.3. On lines 7–8 of Listing 7.7 we retrieve the learned representations (7.32) which are used as the new features in the Poisson GLM on lines 13–14. The resulting MLE *<sup>β</sup>*MLE <sup>∈</sup> <sup>R</sup>*qd*+<sup>1</sup> is imputed to the network parameter *<sup>ϑ</sup>* on lines 17–20. Table 7.5 shows the performance of the resulting bias regularized FN networks.

Firstly, we observe from the last column of Table 7.5 that, indeed, the bias regularization step (7.33) provides the balance property. In general, in-sample losses (have to) decrease because *<sup>β</sup>*MLE is (in-sample) more optimal than the early stopped SGD solution *<sup>β</sup>*. Out-of-sample this leads to a small improvement in the one-

**Table 7.5** Run times, number of parameters, in-sample and out-of-sample deviance losses (units are in 10−2) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of Table 5.5 and the FN network models (with one-hot encoding and embedding layers of dimension *b* = 2, respectively), and their bias regularized counterparts


hot encoded variant and a small worsening in the embedding variant, i.e., the latter slightly over-fits in this additional MLE step. However, these differences are comparably small so that we do not further worry about the over-fitting, here. This closes this example. -

#### **Auto-Calibration for Bias Regularization**

We present another approach of correcting for the potential failure of the balance property. This method does not depend on a particular type of regression model, i.e., it can be applied to any regression model. This proposal goes back to Denuit et al. [97], and it is based on the notion of *auto-calibration* introduced by Patton [297] and Krüger–Ziegel [227]. We first describe auto-calibration and its implications.

**Definition 7.13** The random variable *Z* is an auto-calibrated forecast of random variable *<sup>Y</sup>* if <sup>E</sup>[*<sup>Y</sup>* <sup>|</sup>*Z*] = *<sup>Z</sup>*, a.s.

If the response *Y* is described by the features *X* = *x*, we consider the conditional mean of *Y* , given *X*,

$$
\mu(X) = \mathbb{E}\left[Y|X\right].
$$

This conditional mean *μ(X)* is an auto-calibrated forecast for the response *Y* . Use the tower property and note that *σ (μ(X))* ⊂ *σ (X)* to receive, a.s.,

$$\mathbb{E}\left[Y|\,\mu(X)\right] = \mathbb{E}\left[\mathbb{E}\left[Y|X\right]|\,\mu(X)\right] = \mathbb{E}\left[\mu(X)|\,\mu(X)\right] = \mu(X).$$

For the further understanding of auto-calibration and forecast dominance, we introduce the concept of *convex order*; forecast dominance has been introduced in Definition 4.20.

**Definition 7.14 (Convex Order)** A random variable *Z*<sup>1</sup> is bigger in convex order than a random variable *<sup>Z</sup>*2, write *<sup>Z</sup>*<sup>1</sup> ,cx *<sup>Z</sup>*2, if <sup>E</sup>[*(Z*1*)*] ≥ <sup>E</sup>[*(Z*2*)*], for all convex functions for which the expectations exist.

By Strassen's theorem [346], *Z*<sup>1</sup> ,cx *Z*<sup>2</sup> if and only if there exist random variables *Z* <sup>1</sup> and *Z* <sup>2</sup> with *Z*<sup>1</sup> *(*d*)* = *Z* <sup>1</sup> and *Z*<sup>2</sup> *(*d*)* = *Z* <sup>2</sup> and <sup>E</sup>[*Z* 1|*Z* <sup>2</sup>] = *Z* <sup>2</sup>, a.s. In particular, the convex order *<sup>Z</sup>*<sup>1</sup> ,cx *<sup>Z</sup>*<sup>2</sup> implies that Var*(Z*1*)* <sup>≥</sup> Var*(Z*2*)* and <sup>E</sup>[*Z*1] = <sup>E</sup>[*Z*2]. The latter follows from Strassen's theorem and the tower property, and the former follows from the latter and the convex order by using the explicit choice *(x)* <sup>=</sup> *<sup>x</sup>*2. Thus, the random variable *Z*<sup>1</sup> is more volatile than *Z*2, both having the same mean. The following theorem shows that this additional volatility is a favorable property in terms of forecast dominance under auto-calibration.

**Theorem 7.15 (Krüger–Ziegel [227, Theorem 3.1], Without Proof)** *Assume that <sup>μ</sup>*<sup>1</sup> *and <sup>μ</sup>*<sup>2</sup> *are auto-calibrated forecasts for the random variable <sup>Y</sup> . Predictor <sup>μ</sup>*<sup>1</sup> *forecast dominates <sup>μ</sup>*<sup>2</sup> *if and only if <sup>μ</sup>*<sup>1</sup> ,*cx <sup>μ</sup>*2*.*

Recall that forecast dominance of *<sup>μ</sup>*<sup>1</sup> over *<sup>μ</sup>*<sup>2</sup> was defined as follows, see Definition 4.20,

$$\mathbb{E}\left[D\_{\psi}\left(Y,\widehat{\mu}\_{1}\right)\right] \le \mathbb{E}\left[D\_{\psi}\left(Y,\widehat{\mu}\_{2}\right)\right],$$

for all Bregman divergences *Dψ* . Strassen's theorem tells us that *<sup>μ</sup>*<sup>1</sup> is more volatile than *<sup>μ</sup>*<sup>2</sup> (both being auto-calibrated and unbiased for <sup>E</sup>[*<sup>Y</sup>* ]) and this additional volatility implies that the former auto-calibrated predictor can better follow *Y* . This provides the superior forecast dominance of *<sup>μ</sup>*<sup>1</sup> over *<sup>μ</sup>*2. This relation is most easily understood by the following example. Consider *(Y, X)* as above. Assume that the feature *<sup>X</sup>*\* is a sub-variable of the feature *<sup>X</sup>* by dropping some of the components of *<sup>X</sup>*. Naturally, we have *σ (X*\**)* <sup>⊂</sup> *σ (X)*, and both sets of information provide autocalibrated forecasts

$$
\mu(X) = \mathbb{E}\left[Y|X\right] \qquad \text{and} \qquad \mu(\widetilde{X}) = \mathbb{E}\left[Y|\widetilde{X}\right].
$$

The tower property and Jensen's inequality give for any convex function (subject to existence)

$$\begin{aligned} \mathbb{E}\left[\Psi(\mu(X))\right] &= \mathbb{E}\left[\Psi\left(\mathbb{E}\left[Y|X\right]\right)\right] = \mathbb{E}\left[\mathbb{E}\left[\Psi\left(\mathbb{E}\left[Y|X\right]\right)\left|\widetilde{X}\right]\right] \\ &\ge \mathbb{E}\left[\Psi\left(\mathbb{E}\left[\mathbb{E}\left[Y|X\right]|\widetilde{X}\right]\right)\right] = \mathbb{E}\left[\Psi\left(\mathbb{E}\left[Y|\widetilde{X}\right]\right)\right] = \mathbb{E}\left[\Psi\left(\mu(\widetilde{X})\right)\right]. \end{aligned}$$

Thus, we have *μ(X)* ,cx *μ(X*\**)* which implies forecast dominance of *μ(X)* over *μ(X*\**)*. This makes perfect sense in view of *σ (X*\**)* <sup>⊂</sup> *σ (X)*. Basically, this describes the construction of a F-martingale using an integrable random variable *Y* and a filtration F on the underlying probability space *(-, <sup>A</sup>,* <sup>P</sup>*)*. This martingale sequence provides forecast dominance with increasing information sets described by the filtration F.

We now turn our attention to the balance property and the unbiasedness of predictors, this follows Denuit et al. [97]. Assume we have any predictor *μ(x)* of *<sup>Y</sup>* , for instance, this can be any FN network predictor *<sup>μ</sup> <sup>ϑ</sup> (x)* coming from an early stopped SGD algorithm. We define its *balance-corrected* version by

$$
\widehat{\mu}\_{\rm BC}(\mathbf{x}) = \mathbb{E}\left[Y \, \middle| \, \widehat{\mu}(\mathbf{x})\right]. \tag{7.34}
$$

**Proposition 7.16 (Wüthrich [391, Proposition 4.6], Without Proof)** *The balance-corrected predictor <sup>μ</sup>BC(X) is an auto-calibrated forecast for <sup>Y</sup> .*

*Remarks 7.17 (Expected Deviance Generalization Loss)* We return to the decomposition of the expected deviance GL given in Theorem 4.7, but we add the features *<sup>X</sup>* <sup>=</sup> *<sup>x</sup>*, now. The expected deviance GL of a predictor *μ(X)* under the unit deviance d then reads as

$$\begin{aligned} \mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, \widehat{\mu} (X) \right) \right] &= \mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, \mu \right) \right] \\ &+ 2 \Big( \mu h(\mu) - \kappa \left( h(\mu) \right) - \mathbb{E}\_{\theta} \left[ Y h \left( \widehat{\mu} (X) \right) \right] + \mathbb{E}\_{\theta} \left[ \kappa \left( h \left( \widehat{\mu} (X) \right) \right) \right] . \end{aligned}$$

where *<sup>μ</sup>* <sup>=</sup> <sup>E</sup>*<sup>θ</sup>* [*<sup>Y</sup>* ] is the unconditional mean of *<sup>Y</sup>* (averaging also over the feature distribution of *<sup>X</sup>*). Note that this formula differs from (4.13) because *<sup>Y</sup>* and *h( μ(X))* are no longer independent if we include the features *<sup>X</sup>*. The term <sup>E</sup>*<sup>θ</sup>* [<sup>d</sup> *(Y, μ)*] is called the *entropy* which is driven by the stochastic nature of the random variable *Y* . This is the irreducible risk if no feature information is available.

In statistical modeling one considers different decompositions of the expected deviance GL, we refer to Fissler et al. [129]. Namely, introducing the features *X* we can reduce the expected deviance GL compared to the unconditional mean *μ* in terms of forecast dominance. This allows us to decouple as follows for the prediction *μ(X)* <sup>=</sup> <sup>E</sup>*<sup>θ</sup>* [*<sup>Y</sup>* <sup>|</sup>*X*]

$$\begin{aligned} \mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, \widehat{\mu} (X) \right) \right] &= \mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, \mu \right) \right] - \left( \mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, \mu \right) \right] - \mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, \mu (X) \right) \right] \right) \\ &+ \left( \mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, \widehat{\mu} (X) \right) \right] - \mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, \mu (X) \right) \right] \right). \end{aligned}$$

This expresses the expected deviance GL of the predictor *μ(X)* as the entropy (first term), the *conditional resolution* (second term) and the *conditional calibration* (third term). The conditional resolution describes the information gain in terms of forecast dominance knowing the feature *X*, and the conditional calibration describes how well we estimate *μ(X)*. The conditional resolution is positive because *μ(X)* ,cx *μ* and the unit deviance <sup>d</sup>*(Y,*·*)* is a convex function, see Lemma 2.22. The conditional calibration is also positive, this can be seen by considering the deviance GL, conditional on *X*.

We can reformulate this expected deviance GL in terms of the auto-calibration property

$$\begin{aligned} \mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, \widehat{\mu} (X) \right) \right] &= \mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, \mu \right) \right] - \left( \mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, \mu \right) \right] - \mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, \widehat{\mu}\_{\text{BC}} (X) \right) \right] \right) \\ &+ \left( \mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, \widehat{\mu} (X) \right) \right] - \mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, \widehat{\mu}\_{\text{BC}} (X) \right) \right] \right). \end{aligned}$$

The first term is the entropy, the second term is called the *auto-resolution* and the third term describes the *auto-calibration*. If we have an auto-calibrated forecast *μ(X)* then the last term vanishes because it is equal to its balance-corrected version *<sup>μ</sup>*BC*(X)*. Again these two latter terms are positive, for the auto-calibration this can be seen by considering the deviance GL, conditioned on *μ(X)*.

To rectify the balance property we directly focus on (7.34), and we *estimate* this conditional expectation. That is, the balance correction can be achieved by an additional regression step directly estimating the balance-corrected version *<sup>μ</sup>*BC*(x)* in (7.34). This additional regression step differs from (7.33) as it does not use the learned representations *<sup>z</sup>(d*:1*) (x)* in the last FN layer (7.32), but it uses the learned representations in the output layer. That is, consider the learned features

$$\widehat{\boldsymbol{z}}\_{1}^{\star} = \left(1, \mu\_{\widehat{\mathfrak{P}}}(\mathbf{x}\_{1})\right)^{\top}, \; \; \; \; \; \; \; \; \; \; \widehat{\boldsymbol{z}}\_{n}^{\star} = \left(1, \mu\_{\widehat{\mathfrak{P}}}(\mathbf{x}\_{n})\right)^{\top} \in \; \; \{1\} \times \mathbb{R},$$

and perform an additional linear regression step for the response *Y* using the design matrix

$$\widehat{\mathfrak{X}} = \left(\widehat{z}\_1^\star, \dots, \widehat{z}\_n^\star\right)^\top \in \mathbb{R}^{n \times 2}.$$

This additional linear regression step gives us an estimate

$$
\widehat{\boldsymbol{\mathcal{B}}} = \left(\widehat{\mathfrak{X}}^{\top} V \widehat{\mathfrak{X}}\right)^{-1} \widehat{\mathfrak{X}}^{\top} V \boldsymbol{Y} \in \mathbb{R}^{2}, \tag{7.35}
$$

with diagonal weight matrix *V* = diag*(vi)*<sup>1</sup>≤*i*≤*n*. The balance property is then restored by estimating the balance-corrected means *<sup>μ</sup>*BC*(xi)* by

$$
\widehat{\boldsymbol{\mu}}\_{\text{BC}}(\mathbf{x}\_{l}) = \widehat{\beta}\_{0} + \widehat{\beta}\_{l} \boldsymbol{\mu}\_{\widehat{\boldsymbol{\theta}}}(\mathbf{x}\_{l}), \tag{7.36}
$$

for 1 ≤ *i* ≤ *n*. Note that this can be done for any regression model since we do not rely on the network architecture in this step.

#### *Remarks 7.18*

• Balance correction (7.36) may lead to some conflict in range if the dual (mean) parameter space *M* is (one-sided) bounded. Moreover, it does not consider the deviance loss of the response *Y*, but it rather underlies a Gaussian model by using the weighted square loss function for finding (the Gaussian MLE) *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>2. Alternatively, we could consider the canonical link *h* that belongs to the chosen EDF. This then allows us to study the regression problem on the canonical scale by setting for the learned representations

$$\left(\widehat{\mathbf{z}}\_{1}^{\boldsymbol{\theta}} = \left(1, h(\mu\_{\widehat{\boldsymbol{\theta}}}(\mathbf{x}\_{1})) \right)^{\top}, \dots, \widehat{\mathbf{z}}\_{n}^{\boldsymbol{\theta}} = \left(1, h(\mu\_{\widehat{\boldsymbol{\theta}}}(\mathbf{x}\_{n})) \right)^{\top} \in \{\mathsf{l}\} \times \boldsymbol{\Theta}.\tag{7.37}$$

The latter motivates the consideration of a GLM under the chosen EDF

$$\mathbf{x}\_{l} \mapsto h\left(\widehat{\mu}\_{\text{BC}}(\mathbf{x}\_{l})\right) = \langle \boldsymbol{\mathfrak{f}}, \widehat{\boldsymbol{z}}\_{l}^{\boldsymbol{\theta}} \rangle = \boldsymbol{\beta}\_{0} + \beta\_{1}h(\mu\_{\text{\tilde{\boldsymbol{\theta}}}}(\mathbf{x}\_{l})),\tag{7.38}$$

for regression parameter *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>2. The choice of the canonical link and the inclusion of an intercept will provide the balance property when estimating *β* with MLE, see Corollary 5.7. If the mean estimates *<sup>μ</sup> <sup>ϑ</sup> (xi)* involve the canonical link *h*, (7.38) reads as

$$\mathbf{x}\_{l} \mapsto \boldsymbol{h} \left( \widehat{\mu}\_{\mathrm{BC}} (\mathbf{x}\_{l}) \right) = \langle \boldsymbol{\mathfrak{g}}, \widehat{\boldsymbol{z}}\_{l}^{\boldsymbol{\theta}} \rangle = \boldsymbol{\beta}\_{0} + \boldsymbol{\beta}\_{l} \left\langle \widehat{\boldsymbol{\mathfrak{g}}}, \widehat{\boldsymbol{z}}^{(d:l)} (\mathbf{x}\_{l}) \right\rangle,$$

the latter scalar product is the output activation received from the FN network. From this we see that the estimated balance-corrected calibration on the canonical scale will give us a non-optimal (in-sample) estimation step compared to (7.33), if we work with the canonical link *h*.

• Denuit et al. [97] give a proposal to break down the global balance to a local version using a suitable kernel function, this will be further discussed in the next Example 7.19.

*Example 7.19 (Auto-calibration in Networks)* We apply this additional autocalibration step (7.34) to the FN network with embedding layers that does not satisfy the balance property, i.e., having an average frequency of 7*.*24% *<* 7*.*36%, see Tables 7.4 and 7.5. We start by analyzing the auto-calibration property (7.34) of this network predictor *vμ <sup>ϑ</sup> (x)* by studying an empirical version of

$$z \mapsto \upsilon \widehat{\mu}\_{\text{BC}}(\mathbf{x}) = \mathbb{E}\left[\upsilon Y \, \middle| \, \upsilon \mu\_{\widehat{\mathfrak{g}}}(\mathbf{x}) = z\right]. \tag{7.39}$$

This empirical version is obtained from the R library locfit [254] that allows us to consider a local polynomial regression fit of degree deg=2, and we use a nearest neighbor fraction of alpha=0.05, the code is provided in Listing 7.8. We use the exposure *v* scaled version in (7.39) since the balance property should hold on that scale, see Corollary 5.7. The claim counts are given by *N* = *vY* , and the exposure *v* is integrated as an offset into the FN network regression function, see line 20 of Listing 7.4.



Figure 7.12 (lhs) shows the empirical auto-calibration of (7.39) using the R code of Listing 7.8. If the auto-calibration would hold exactly, then the black dots should lie on the red diagonal line. We observe a very good match, which indicates that the auto-calibration property holds quite accurately for our network predictor *(v, <sup>x</sup>)* <sup>→</sup> *vμ <sup>ϑ</sup> (x)*. For very small expectations <sup>E</sup>*θ (x)*[*N*] we slightly underestimate, and for bigger expectations we slightly overestimate. The blue line shows the empirical density of the predictors *viμ <sup>ϑ</sup> (xi)*, 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*, highlighting heavy-tailedness and that the underestimation in the right tail will not substantially contribute to the balance property as these are only very few insurance policies.

We explore the Gaussian balance correction (7.35) considering a linear regression model with weighted square loss function. We receive the estimate *<sup>β</sup>* <sup>=</sup> *(*<sup>9</sup> · 10−4*,* 1*.*005*)*-, thus, *<sup>μ</sup> <sup>ϑ</sup> (x)* only gets very gently distorted, see (7.36). The results of this balance-corrected version *<sup>μ</sup>*BC*(x)* are given on line 'embed FN Gauss balancecorrected' in Table 7.6. We observe that this approach is rather competitive leading to a slightly better model (out-of-sample). Figure 7.12 (rhs) shows the resulting (empirical) auto-calibration plot which is still not fully in line with Proposition 7.16; this empirical plot may be distorted by the exposures, by the fact that it is an

**Fig. 7.12** (lhs) Empirical auto-calibration (7.39), the blue line shows the empirical density of the predictors *viμ <sup>ϑ</sup> (xi)*, 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*; (rhs) balance-corrected version using the weighted Gaussian correction (7.35)

**Table 7.6** Run times, number of parameters, in-sample and out-of-sample deviance losses (units are in 10−2) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of Table 5.5, the FN network model (with embedding layers of dimension *b* = 2), and their bias regularized and balance-corrected counterparts, the local correction uses a GAM with 2.6 degrees of freedom in the cubic spline part


empirical plot fitted with locfit, and by fact that a linear Gaussian correction estimate may not be fully suitable.

Denuit et al. [97] propose a local balance correction that is very much in the spirit of the local polynomial regression fit with locfit. However, when using locfit we did not pay any attention to the balance property. Therefore, we proceed slightly differently, here. In formula (7.37) we give the network predictors on the canonical scale. This equips us with the data *(Yi, vi, <sup>z</sup><sup>θ</sup> <sup>i</sup> )*1≤*i*≤*n*. To perform a local balance correction we fit a generalized additive model (GAM) to this data, using the canonical link, the Poisson deviance loss function, the observations *Yi*, the exposures *vi* and the feature information *<sup>z</sup><sup>θ</sup> <sup>i</sup>* ; for GAMs we refer to Hastie– Tibshirani [181, 182], Wood [384] and Chapter 3 in Wüthrich–Buser [392], in particular, we proceed as in Example 3.4 of the latter reference.

The GAM regression fit on the canonical scale is illustrated in Fig. 7.13 (lhs). We essentially receive a straight line which says that the auto-calibration property is already well satisfied by the FN network predictor *<sup>μ</sup> <sup>ϑ</sup>* . In fact, it is not completely a straight line, but GCV provides an optimal model with 2.6 effective degrees of freedom in the natural cubic spline part. This local (GAM) balance correction leads to another small model improvement (out-of-sample), see last line of Table 7.6.

**Conclusion** The balance property adjustment and the bias regularization are crucial in ensuring that the predictive model is on the right (price) level. We have presented three sophisticated methods of balance property adjustments: the additional GLM step under the canonical link choice (7.33), the model-free global Gaussian correction (7.35)–(7.36), and the local balance correction using a GAM under the canonical link choice. In our example, the results of the three different approaches are rather similar. In the sequel, we use the additional GLM step solution (7.33), the reason being that under this approach we can rely on one single regression model that directly predicts the claims. The other two approaches need two steps to get the predictions, which requires the storage of two models. -

**Fig. 7.13** (lhs) GAM fit on the canonical scale having 2.6 effective degrees of freedom (red shows the estimated confidence bounds); (rhs) balance-corrected version using the local GAM correction

## *7.4.3 Boosting Regression Models with Network Features*

From Table 7.5 we conclude that the FN networks find systematic structure in the data that is not present in model Poisson GLM3, thus, the feature engineering for the GLM can be improved. Unfortunately, FN networks neither directly build on GLMs nor do they highlight the weaknesses of GLMs. In this section we discuss a proposal presented in Wüthrich–Merz [394] and Schelldorfer–Wüthrich [329] of combining two regression approaches. We are going to boost a GLM with FN network features. Typically, boosting is applied within the framework of regression trees. It goes back to the work of Valiant [362], Kearns–Valiant [209, 210], Schapire [328], Freund [139] and Freund–Schapire [140]. The idea behind boosting is to analyze the residuals of a given regression model with a second regression model to see whether this second regression model can still find systematic effects in the residuals which have not been discovered by the first one.

We start from the GLM studied in Chap. 5, and we boost this GLM with a FN network. Assume that both regression models act on the same feature space *X* ⊂ {1} × <sup>R</sup>*q*<sup>0</sup> . The GLM provides a regression function for link function *<sup>g</sup>* and GLM parameter *<sup>β</sup>*GLM <sup>∈</sup> <sup>R</sup>*q*0+<sup>1</sup>

$$\mathbf{x} \mapsto \mu^{\mathrm{GLM}}(\mathbf{x}) = g^{-1} \left< \mathcal{F}^{\mathrm{GLM}}, \mathbf{x} \right> .$$

Recall that this GLM can be interpreted as a FN network of depth 0, see Remarks 7.2. Next, we choose a FN network of depth *d* ≥ 1 with the same link function *g* as the GLM

$$\mathbf{x} \mapsto \mu^{\mathrm{FN}}(\mathbf{x}) = \operatorname{g}^{-1}\left\langle \mathfrak{g}^{\mathrm{FN}}, \mathfrak{z}^{(d:1)}(\mathbf{x}) \right\rangle,$$

having a network parameter *<sup>ϑ</sup>* <sup>=</sup> *(w(*1*)* <sup>1</sup> *,..., <sup>w</sup>(d) qd , β*FN*)*- <sup>∈</sup> <sup>R</sup>*r*. In particular, we have the FN output parameter *<sup>β</sup>*FN <sup>∈</sup> <sup>R</sup>*qd*+1, we refer to Fig. 7.2.

We blend these two regression models by combining their regression functions

$$\mathbf{x} \mapsto \mu(\mathbf{x}) = \operatorname{g}^{-1}\left\{ \left< \mathcal{B}^{\rm GLM}, \mathbf{x} \right> + \left< \mathcal{B}^{\rm FN}, \mathbf{z}^{(d:\mathbf{l})}(\mathbf{x}) \right> \right\},\tag{7.40}$$

with parameter *!* <sup>=</sup> *(β*GLM*, <sup>ϑ</sup>)*- <sup>=</sup> *(β*GLM*, <sup>w</sup>(*1*)* <sup>1</sup> *,..., <sup>w</sup>(d) qd , β*FN*)*- ∈ R*q*0+1+*r*.

An example is provided in Fig. 7.14. It shows the FN network using embedding layers for the categorical variables, see also Fig. 7.9 (rhs), and we add a GLM (in green color) that directly links the input *x* to the response variable. In machine learning this green connection is called a *skip connection* because it skips the FN layers.

*Remarks 7.20*

• Skip connections are a popular tool in network modeling, and they can be applied to any FN layers, i.e., a skip connection can, for instance, be added to skip the first FN layer. There are two benefits from skip connections. Firstly, they allow for more modeling flexibility, in (7.40) we directly combine a linear function

(coming from the GLM) with a non-linear one (coming form the FN network). This has the flavor of a Taylor expansion to combine terms of different orders. Secondly, skip connections can also be beneficial for gradient descent fitting because the inputs have a more direct link to the outputs, and the network only builds the functional form around the function in the skip connection.

• There are numerous variants of (7.40). A straightforward one is to choose a weight *α* ∈ *(*0*,* 1*)* and consider the regression function

$$\mathbf{x} \mapsto \mu(\mathbf{x}) = \mathbf{g}^{-1} \left\{ \alpha \left\langle \boldsymbol{\theta}^{\mathrm{GLM}}, \mathbf{x} \right\rangle + (1 - \alpha) \left\langle \boldsymbol{\theta}^{\mathrm{FN}}, \mathbf{z}^{(d:\mathbf{l})}(\mathbf{x}) \right\rangle \right\}. \tag{7.41}$$

The weight *α* can be interpreted as the credibility assigned to the GLM.


$$\mu(\mathbf{x},\boldsymbol{\chi}) \mapsto \mu(\mathbf{x},\boldsymbol{\chi}) = \operatorname{g}^{-1}\left\{ \sum\_{j=1}^{3} \left\langle \boldsymbol{\mathcal{B}}\_{j}^{\mathrm{GLM}}, \mathbf{x} \right\rangle \mathbb{1}\_{\{\boldsymbol{\chi}=j\}} + \left\langle \boldsymbol{\mathcal{B}}^{\mathrm{FN}}, \mathbf{z}^{(d:\mathbf{l})}(\mathbf{x},\boldsymbol{\chi}) \right\rangle \right\}.$$

The indicator <sup>1</sup>{*χ*=*<sup>j</sup>* } chooses the GLM that belongs to the corresponding insurance portfolio *<sup>χ</sup>* ∈ {1*,* <sup>2</sup>*,* <sup>3</sup>} with the (individual) GLM parameter *<sup>β</sup>*GLM *<sup>χ</sup>* . The FN network term makes them related, i.e., the GLMs of the different insurance portfolios interact (jointly learn) via the FN network module. This is the approach used in Gabrielli et al. [149] to improve the chain-ladder reserving method by learning across different claims reserving triangles.

The regression function (7.40) gives the structural form of the combined regression model, but there is a second important ingredient proposed by Wüthrich– Merz [394]. Namely, the gradient descent algorithm (7.15) for model fitting can be started in an initial network parameter *!(*0*)* <sup>∈</sup> <sup>R</sup>*q*0+1+*<sup>r</sup>* that corresponds to the MLE of the GLM. Denote by *<sup>β</sup>*GLM the MLE of the GLM part, only.

Choose the initial value of the gradient descent algorithm for the fitting of the combined regression model (7.40)

$$\Phi^{(0)} = \left(\widehat{\boldsymbol{\beta}}^{\text{GLM}}, \boldsymbol{w}\_1^{(1)}, \dots, \boldsymbol{w}\_{q\_d}^{(d)}, \boldsymbol{\beta}^{\text{FN}} \equiv 0\right)^{\top} \in \mathbb{R}^{q\_0 + 1 + r},\tag{7.42}$$

that is, initially, no signals traverse the FN network part because we set *<sup>β</sup>*FN <sup>≡</sup> 0.

#### *Remarks 7.21*


Implementation of the general combined regression model (7.40) can be a bit cumbersome, see Listing 4 in Gabrielli et al. [149], but things can substantially be simplified by declaring the GLM part in (7.40) as being non-trainable, i.e., estimating *<sup>β</sup>*GLM by *<sup>β</sup>*GLM in the GLM, and then freeze this parameter. In view of (7.40) this simply means that we add an offset *oi* <sup>=</sup> *<sup>β</sup>*GLM*, <sup>x</sup><sup>i</sup>* to the FN network that is treated as a prior difference between the different data points, we refer to Sect. 5.2.3.

*Example 7.22 (Combined GLM and FN Network)* We revisit the French MTPL claim frequency GLM of Sect. 5.3.4, and we boost model Poisson GLM3 with FN network features. For the FN architecture we use the structure depicted in Fig. 7.14, i.e., a FN network of depth *d* = 3 having *(q*1*, q*2*, q*3*)* = *(*20*,* 15*,* 10*)* neurons, and using embedding layers of dimension *b* = 2 for the categorical feature components. Moreover, we declare the GLM part to be non-trainable which allows us to use the GLM as an offset in the FN network. Moreover, we apply bias regularization (7.33) to receive the balance property.

The results are presented in Table 7.7. A first observation is that using model Poisson GLM3 as an offset reduces the run time of gradient descent fitting because we start the algorithm already in a reasonable model. Secondly, as expected, the

**Table 7.7** Run times, number of parameters, in-sample and out-of-sample deviance losses (units are in 10−2) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of Table 5.5, the FN network model (with embedding layers of dimension *b* = 2), and the combined regression model GLM3+FN, see (7.40)


FN features decrease the loss of model Poisson GLM3, this indicates that there are systematic effects that are not captured by the GLM. The final combined and regularized model has roughly the same out-of-sample loss as the corresponding FN network, showing that this approach can be beneficial in run times, and the predictive power is similar to a pure FN network. -

*Example 7.23 (Improving Model Poisson GLM3)* In this example we would like to explore the deficiencies of model Poisson GLM3 by boosting it with FN network features. We do this in a systematic way by only considering two (continuous) features components at a time in the FN network. That is, we consider the combined approach (7.40) with initialization (7.42), but as feature information for the network part, we only consider two components at a time. For instance, we start with the features *(*1*,* Area*,* VehPower*)* ∈ {1} ×R<sup>2</sup> for the network part, and the remaining feature information is ignored in this step. This way we can test whether the marginal modeling of Area and VehPower is suitable in model Poisson GLM3, and whether a pairwise interaction in these two components is missing. We train this FN network starting from model Poisson GLM3 (and keeping this GLM part frozen). The decrease in the out-of-sample loss during the gradient descent training is shown in Fig. 7.15 (top-left). We observe that the loss remains rather constant over 100 training epochs. This tells us that the pair *(*Area*,* VehPower*)* is appropriately considered in model Poisson GLM3.

Figure 7.15 gives all pairwise plots of the continuous feature components Area, VehPower, VehAge, DrivAge, BonusMalus, Density, the scale on the *y*axis is identical in all plots. We observe that only the plots including the variable BonusMalus provide a bigger decrease in loss (in blue color in the colored version). This indicates that mainly this feature component is not modeled optimally in model Poisson GLM3, because boosting with a FN network finds systematic structure here that improves the loss of model Poisson GLM3. In model Poisson GLM3, the variable BonusMalus has been modeled log-linearly with an interaction term with DrivAge and *(*DrivAge*)* 2, see (5.35). Table 7.8 shows the result if we add a FN network feature (7.40) for the pair *(*DrivAge*,* BonusMalus*)* to model Poisson GLM3. Indeed, we see that the resulting combined GLM-FN network model has the same GL as the full FN network approach. Thus, we conclude that model Poisson GLM3 performs fairly well and only the modeling of the pair *(*DrivAge*,* BonusMalus*)* should be improved. -

## *7.4.4 Network Ensemble Learning*

Ensemble learning is a popular way of expressing that one takes an average over different predictors. There are many established methods that belong to the family of ensemble learning, e.g., there is **b**oostrap **agg**regat**ing** (called *bagging*) introduced by Breiman [51], there are random forests, and there is boosting. Random forests

**Fig. 7.15** Exploring all pairwise interactions: out-of-sample losses over 100 gradient descent epochs for all pairs of the continuous feature components Area, VehPower, VehAge, DrivAge, BonusMalus, Density (the scale on the *y*-axis is identical in all plots)

and boosting are mainly based on classification and regression trees (CARTs) and they belong to the most powerful machine learning methods for tabular data. These methods combine a family of predictors to a more powerful predictor. The present section is inspired by the bagging method of Breiman [51], and we perform **n**etwork **agg**regat**ing** (called *nagging*).

#### **Stochastic Gradient Descent Fitting of Networks**

We have described that network calibration involves several elements of randomness. This in combination with early stopping leads to the non-uniqueness of reasonably good networks for prediction and pricing. We have discussed this based

**Table 7.8** Run times, number of parameters, in-sample and out-of-sample deviance losses (units are in 10−2) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of Table 5.5, model Poisson GLM3 with additional FN features for (DrivAge, BonusMalus), the FN network model (with embedding layers of dimension *b* = 2), and the combined regression model GLM3+FN, see (7.40)


on Fig. 7.5, namely, for a given network architecture we have a continuum of comparably good models (w.r.t. the chosen objective function) that lie in the green area of Fig. 7.5. One SGD calibration picks one specific model from this green area, we also refer to Remarks 7.9. Of course, this is very unsatisfactory in insurance pricing because it implies that the selection of a price for an insurance policy has a substantial element of subjectivity (that cannot be explained to the customer). Naturally, we would like to combine models in the green area of Fig. 7.5, for instance, by performing some sort of integration over the models in the green area. Intuitively, this should lead to a very powerful predictive model because it diversifies the weaknesses of each individual model. This is exactly what we discuss in this section. Before doing so, we would first like to understand the different single calibrations of a given network architecture.

We consider the MTPL data of Example 7.12. We model this data with a Poisson FN network using embedding layers for the categorical features and using bias regularization (7.33) to guarantee the balance property to hold. For the FN network architecture we choose depth *d* = 3 with *(q*1*, q*2*, q*3*)* = *(*20*,* 15*,* 10*)* FN neurons; this setup gives us the results on the last line of Table 7.5. We now repeat this procedure *M* = 1 600 times, using exactly the same FN network architecture, the same early stopping strategy, the same SGD method and the same batch size. We only change the seeds of the starting point *<sup>ϑ</sup>(*0*)* <sup>∈</sup> <sup>R</sup>*<sup>r</sup>* of the SGD algorithm, the partitioning of the learning data *L* into training data *U* and validation data *V*, see Fig. 7.7, and the partitioning of the training data into the (mini-)batches.

The resulting 1 600 in-sample and out-of-sample deviance losses are presented in Fig. 7.16. We observe a considerable variation in these figures. The in-sample losses vary between 23.616 and 23.815 (mean 23.728), and the corresponding outof-sample loss between 23.766 and 23.899 (mean 23.819), units are in 10−2; note that all network calibrations are bias regularized. The in-sample loss is an average over *n* = 610 206 (individual) unit deviance losses, and the out-of-sample an average over *T* = 67 801 unit deviance losses, see also Definition 4.24. Therefore, we expect an even much bigger variation on individual insurance policies. We are going to analyze this in more detail in this section.

**Fig. 7.16** Boxplots over 1 600 network calibrations only differing in the seeds for the SGD algorithm and the partitioning of the learning data: (lhs) in-sample losses on *L* and (rhs) outof-sample losses on *T* , the horizontal lines show the calibration chosen in Table 7.5; units are in 10−<sup>2</sup>

Before doing so, we would like to understand whether there is some dependence between the in-sample and the out-of-sample losses over the *M* = 1 600 runs of the SGD algorithm with different seeds. In Fig. 7.17 we provide a scatter plot of the out-of-sample losses vs. the in-sample losses. This plot is complemented by a cubic spline regression (in orange color). From this plot we conclude that the models with very small in-sample losses tend to over-fit, and the models with large in-sample losses tend to under-fit (always using the same early stopping rule). In view of these results we conclude that the chosen early stopping rule is sensible because on average it tends to provide the model with the smallest out-of-sample loss on *T* . Recall that we do not use *T* during the SGD fitting, but only the learning data *L* that is split into the training data *U* and the validation data *V* for exercising the early stopping, see Fig. 7.7.

Next, we study the estimated prices on the test data (out-of-sample)

$$\mathcal{T} = \left\{ (Y\_t^\dagger = N\_t^\dagger / v\_t^\dagger, \mathbf{x}\_t^\dagger, v\_t^\dagger) \, : \, t = 1, \dots, T = 67'801 \right\} \dots$$

For each run of the SGD algorithm we receive a different (early stopped) network parameter estimate *<sup>ϑ</sup><sup>m</sup>* <sup>∈</sup> <sup>R</sup>*r*, 1 <sup>≤</sup> *<sup>m</sup>* <sup>≤</sup> *<sup>M</sup>* <sup>=</sup> <sup>1</sup> 600. Using these parameter estimates we receive the estimated network regression functions, for 1 ≤ *m* ≤ *M*,

$$\mathbf{x} \mapsto \widehat{\mu}^m(\mathbf{x}) = \mu\_{\widehat{\mathfrak{b}}^m}(\mathbf{x}),$$

using the FN network of Listing 7.4 with network parameter *<sup>ϑ</sup>m*. Thus, for the outof-sample policies 1 ≤ *t* ≤ *T* we receive the expected frequencies

$$\mathbf{x}\_{\iota}^{\dagger} \mapsto \ \widehat{\boldsymbol{\mu}}\_{\iota}^{m} = \widehat{\boldsymbol{\mu}}^{m} \left( \mathbf{x}\_{\iota}^{\dagger} \right) = \boldsymbol{\mu}\_{\widehat{\mathfrak{G}}^{m}} \left( \mathbf{x}\_{\iota}^{\dagger} \right) .$$

Since we choose the seeds of the SGD runs *at random* we may (and will) assume that we have independence between the prices *( <sup>μ</sup><sup>m</sup> <sup>t</sup> )t*<sup>∈</sup>*<sup>T</sup>* of the different runs 1 ≤ *m* ≤ *M* of the SGD algorithm. This allows us to estimate the average price and the coefficient of variation of these prices of a fixed insurance policy *t* over the different SGD runs

$$\bar{\mu}\_{\rm{r}}^{(1:M)} = \frac{1}{M} \sum\_{m=1}^{M} \widehat{\mu}\_{\rm{r}}^{m} \quad \text{and} \quad \mathrm{Vco}\_{\rm{l}} = \frac{1}{\bar{\mu}\_{\rm{l}}^{(1:M)}} \sqrt{\frac{1}{M-1} \sum\_{m=1}^{M} \left(\widehat{\mu}\_{\rm{l}}^{m} - \bar{\mu}\_{\rm{l}}^{(1:M)}\right)^{2}}. \tag{7.43}$$

These (out-of-sample) coefficients of variation are illustrated in Fig. 7.18. We observe a considerable variation on some policies. The average coefficient of variation is roughly 10% (orange horizontal line, lhs). The maximal coefficient of variation is about 40%, thus, for this policy the individual prices *<sup>μ</sup><sup>m</sup> <sup>t</sup>* of the different SGD runs 1 ≤ *m* ≤ *M* fluctuate considerably around *μ*¯ *(*1:*M) <sup>t</sup>* . This now explains why we choose *M* = 1 600 SGD runs, namely, the averaging in (7.43) reduces the coefficient of variation on this policy to 40%*/* <sup>√</sup>*<sup>M</sup>* <sup>=</sup> 40%*/*<sup>40</sup> <sup>=</sup> 1%, note that we have independence between the different SGD runs. Thus, by averaging we receive an acceptable influence of the variation of the individual SGD fittings.

Listing 7.9 shows the 10 policies (out-of-sample) with the largest coefficients of variations Vco*t* . These polices have in common that they belong to the lowest BonusMalus level, the drivers are very young, the cars are comparably old and they have a bigger vehicle power. From a practical point of view we should doubt these policies, since the information provided may not be correct. New drivers (at the age of 18) typically enter a bonus-malus scheme at level 100, and only after several accident-free years these drivers can reach a bonus-malus level of 50. Thus, policies as in Listing 7.9 should not exist, and our pricing framework has difficulties to (correctly) handle them. In practice, this needs further investigation because, obviously, there is a data issue, here.

**Fig. 7.18** Out-of-sample coefficients of variations Vco*<sup>t</sup>* on an individual policy level 1 ≤ *t* ≤ *T* over the 1 600 calibrations (lhs) scatter plot against the average estimated frequencies *μ*¯ *(*1:*M) <sup>t</sup>* and (rhs) resulting histogram


**Listing 7.9** The 10 policies (out-of-sample) with the largest coefficients of variation

#### **Nagging Predictor**

The previously observed variations of the prices motivate to average over the different models (network calibrations). This brings us to bagging introduced by Breiman [51]. Bagging is based on averaging/aggregating over several 'independent' predictions; this is done in three steps. In a first step, a model is fitted to the data *<sup>L</sup>*. In a second step, independent bootstrap samples *<sup>L</sup>*∗*(m)* are generated from this fitted model; the independence has to be understood in a conditional sense, namely, the different bootstrap samples *<sup>L</sup>*∗*(m)* are independent in *<sup>m</sup>*, given the data *<sup>L</sup>*. In the third step, for every bootstrap sample *<sup>L</sup>*∗*(m)* one estimates a model *<sup>μ</sup>m*, and averaging (7.43) provides the bagging predictor. Bagging is mainly a *variance reduction* technique. Note that if the fitted model of the first step has a bias, then likely the bootstrap samples *<sup>L</sup>*∗*(m)* are biased, and so is the bagging predictor. Therefore, bagging does not help to reduce a potential bias. All these results have to be understood conditionally on the data *L*. If this data is atypical for the problem, so will the bootstrap samples be.

We can perform a similar analysis for the fitted networks, but we do not need to bootstrap, here, because the various elements of randomness in SGD fitting allow us to generate independent predictors *<sup>μ</sup>m*, conditional on the data *<sup>L</sup>*. Averaging (7.43) over these predictors then provides us with the **n**etwork **agg**regat**ing** (nagging) predictor *<sup>μ</sup>*¯ *(*1:*M)*; we also refer to Dietterich [105] and Richman–Wüthrich [315] for this aggregation. Thus, we replace the bootstrap step by the different runs of the SGD algorithm. Both options provide independent predictors *<sup>μ</sup>m*, conditional on the data *L*. However, there is a fundamental difference between bagging and nagging. Bagging generates new (bootstrap) samples *<sup>L</sup>*∗*(m)* and, thus, bagging also involves randomness coming from sampling the new observations. Nagging always acts on the same sample *L*, and it only refits the model multiple times. Therefore, the latter will typically introduce less variation. Of course, bagging and nagging can be combined, and then the full expected GL can be estimated, we come back to this in Sect. 11.4, below. We do not sample new observations, here, because we would like to understand the variations implied by the SGD algorithm with early stopping on the given (fixed) data.

In Fig. 7.18 we have seen that we need nagging over 1 600 network calibrations so that the maximal coefficient of variation on an individual policy level is below 1% in our MTPL example. In this section we would like to understand the minimal out-of-sample loss that can be achieved by nagging on the (entire) test data set, and we would like to analyze its rate of convergence.

For this we define the sequence of nagging predictors

$$
\bar{\mu}^{(1:M)}(\mathbf{x}) = \frac{1}{M} \sum\_{m=1}^{M} \widehat{\mu}^{m}(\mathbf{x}) \qquad \text{ for } M \ge 1. \tag{7.44}
$$

This allows us to study the out-of-sample losses on *T* in the Poisson model for *M* ≥ 1

$$\mathfrak{D}(\mathcal{T}, \bar{\mu}^{(\text{l}:M)}) = \frac{2}{T} \sum\_{t=1}^{T} v\_t^\dagger \left( \bar{\mu}^{(\text{l}:M)}(\mathbf{x}\_t^\dagger) - Y\_t^\dagger - Y\_t^\dagger \log \left( \frac{\bar{\mu}^{(\text{l}:M)}(\mathbf{x}\_t^\dagger)}{Y\_t^\dagger} \right) \right).$$

*Remark 7.24* From Remarks 7.17 we know that the expected deviance GL of the estimated model is lower bounded by the expected deviance GL of the true data generating model; the difference is the conditional calibration. Within the family of Tweedie's CP models Richman–Wüthrich [315] proved that, indeed, aggregating decreases monotonically the expected deviance GL of the estimated model (Proposition 2 of [315]), convergence is established (Proposition 3 of [315]), and the speed of convergence is provided using asymptotic normality (Proposition 4 of [315]). For the Gaussian square loss results we refer to Breiman [51] and Bühlmann–Yu [60].

We revisit Proposition 2 of Richman–Wüthrich [315] which has also been proved in Proposition 3.1 of Denuit–Trufin [103]. We only consider a single case in the next proposition and we drop the feature information *x* (because we can condition on *X* = *x*).

**Proposition 7.25** *Choose a response Y* ∼ *f (*·; *θ , v/ϕ) belonging to Tweedie's CP model having a power variance cumulant function κ* = *κp with power variance parameter <sup>p</sup>* ∈ [1*,* <sup>2</sup>]*, see* (2.17)*. Assume <sup>μ</sup> is an estimator for the mean parameter μ* = *κ p(θ ) >* <sup>0</sup> *satisfying <sup>&</sup>lt; <sup>μ</sup>* <sup>≤</sup> *p/(p*−1*)μ, a.s., for some* <sup>∈</sup> *(*0*, p/(p*−1*)μ). Choose i.i.d. copies <sup>μ</sup>m, <sup>m</sup>* <sup>≥</sup> <sup>1</sup>*, of <sup>μ</sup> being all independent of <sup>Y</sup> . We have for all M* ≥ 1

$$\mathbb{E}\_{\theta}\left[\mathfrak{d}\left(Y,\widehat{\mu}^{\mathrm{l}}\right)\right] \geq \mathbb{E}\_{\theta}\left[\mathfrak{d}\left(Y,\widehat{\mu}^{\mathrm{(l:M)}}\right)\right] \geq \mathbb{E}\_{\theta}\left[\mathfrak{d}\left(Y,\widehat{\mu}^{\mathrm{(l:M+1)}}\right)\right] \geq \mathbb{E}\_{\theta}\left[\mathfrak{d}(Y,\mu)\right].$$

*Proof of Proposition 7.25* The lower bound on the right-hand side immediately follows from Theorem 4.19. For an estimate *μ >* <sup>0</sup> we define the function, we also refer to (4.18) and we set for the canonical link *hp* = *(κ p)*−1,

$$
\widehat{\mu} \mapsto \psi\_P(\widehat{\mu}) = \mu h\_P(\widehat{\mu}) - \kappa\_P \left( h\_P(\widehat{\mu}) \right) = \begin{cases}
\mu \log(\widehat{\mu}) - \widehat{\mu} & \text{for } p = 1, \\
\mu \frac{\widehat{\mu}^{1-p}}{1-p} - \frac{\widehat{\mu}^{2-p}}{2-p} & \text{for } p \in (1, 2), \\
\end{cases}
$$

This is the part of the log-likelihood (and deviance loss) that depends on the canonical parameter *<sup>θ</sup>* <sup>=</sup> *hp( μ)*, and replacing the observation *<sup>Y</sup>* by *<sup>μ</sup>*. Calculating the second derivative w.r.t. *<sup>μ</sup>* provides for *<sup>p</sup>* ∈ [1*,* <sup>2</sup>]

$$\frac{\partial^2}{\partial \widehat{\mu}^2} \psi\_p(\widehat{\mu}) = -p\mu \widehat{\mu}^{-p-1} - (1-p)\widehat{\mu}^{-p} = \widehat{\mu}^{-(1+p)} \left[ -p\mu - (1-p)\widehat{\mu} \right] \le 0,$$

the last inequality uses that the square bracket is non-positive, a.s., under our assumptions on *<sup>μ</sup>*. Thus, *ψp* is concave on the interval *(*0*, p/(p* <sup>−</sup> <sup>1</sup>*)μ)*. We now focus on the inequalities for *M* ≥ 1. Consider the decomposition of the nagging predictor for *M* + 1

$$
\bar{\mu}^{(1:M+1)} = \frac{1}{M+1} \sum\_{j=1}^{M+1} \bar{\mu}^{(-j)}, \qquad \text{where} \qquad \bar{\mu}^{(-j)} = \frac{1}{M} \sum\_{m=1}^{M+1} \hat{\mu}^m \mathbb{1}\_{\{m \neq j\}}.
$$

The predictors *<sup>μ</sup>*¯ *(*−*j )*, *<sup>j</sup>* <sup>≥</sup> 1, are copies of *<sup>μ</sup>*¯ *(*1:*M)*, though not independent ones. Using the function *ψp*, the second term on the right-hand side has the same structure as the estimation risk function (4.14),

E*θ* <sup>d</sup>*(Y, <sup>μ</sup>*¯ *(*1:*M))* <sup>=</sup> <sup>E</sup>*<sup>θ</sup>* <sup>d</sup>*(Y, <sup>μ</sup>*¯ *(*1:*M*+1*) )* <sup>+</sup> <sup>2</sup> <sup>E</sup>*<sup>θ</sup> Y hp <sup>μ</sup>*¯ *(*1:*M*+1*)* − *κp hp <sup>μ</sup>*¯*(*1:*M*+1*)* <sup>−</sup> <sup>2</sup> <sup>E</sup>*<sup>θ</sup> Y hp <sup>μ</sup>*¯*(*1:*M)* − *κp hp <sup>μ</sup>*¯*(*1:*M)* <sup>=</sup> <sup>E</sup>*<sup>θ</sup>* <sup>d</sup>*(Y, <sup>μ</sup>*¯ *(*1:*M*+1*) )* + 2 E *ψp <sup>μ</sup>*¯*(*1:*M*+1*)* <sup>−</sup> <sup>E</sup> *ψp <sup>μ</sup>*¯*(*1:*M)* <sup>=</sup> <sup>E</sup>*<sup>θ</sup>* <sup>d</sup>*(Y, <sup>μ</sup>*¯ *(*1:*M*+1*) )* + 2 ⎛ ⎝E ⎡ <sup>⎣</sup>*ψp* ⎛ ⎝ 1 *M* + 1 *M* +1 *j*=1 *<sup>μ</sup>*¯*(*−*j )*<sup>⎞</sup> ⎠ ⎤ <sup>⎦</sup> <sup>−</sup> <sup>E</sup> *ψp <sup>μ</sup>*¯*(*1:*M)*⎞ ⎠ <sup>≥</sup> <sup>E</sup>*<sup>θ</sup>* <sup>d</sup>*(Y,μ*¯ *(*1:*M*+1*) )* + 2 ⎛ ⎝E ⎡ ⎣ 1 *M* + 1 *M* +1 *j*=1 *ψp <sup>μ</sup>*¯ *(*−*j )* ⎤ <sup>⎦</sup> <sup>−</sup> <sup>E</sup> *ψp <sup>μ</sup>*¯*(*1:*M)*⎞ ⎠ <sup>=</sup> <sup>E</sup>*<sup>θ</sup>* <sup>d</sup>*(Y, <sup>μ</sup>*¯ *(*1:*M*+1*) ) ,*

the second last step applies Jensen's inequality to the concave function *ψp*, and the last step follows from the fact that *<sup>μ</sup>*¯ *(*−*j )*, *<sup>j</sup>* <sup>≥</sup> 1, are copies of *<sup>μ</sup>*¯ *(*1:*M)*.

#### *Remarks 7.26*


• If additionally we have unbiasedness of *<sup>μ</sup>* for *<sup>μ</sup>* and a uniformly integrable upper bound on *<sup>μ</sup>*¯ *(*1:*M)*, we can use Lebesgue's dominated convergence theorem and the law of large numbers to prove

$$\lim\_{M \to \infty} \mathbb{E}\_{\theta} \left[ \mathfrak{d} \left( Y, \bar{\mu}^{(1:M)} \right) \right] = \mathbb{E}\_{\theta} \left[ \lim\_{M \to \infty} \mathfrak{d} \left( Y, \bar{\mu}^{(1:M)} \right) \right] = \mathbb{E}\_{\theta} \left[ \mathfrak{d}(Y, \mu) \right]. \tag{7.45}$$

The uniformly integrable upper bound is only needed in the Poisson case *p* = 1, because the other cases are covered by *<sup>&</sup>lt; <sup>μ</sup>* <sup>≤</sup> *p/(p* <sup>−</sup> <sup>1</sup>*)μ*, a.s. Moreover, asymptotic normality can be established, we refer to Proposition 4 in Richman– Wüthrich [315].

We come back to our MTPL Poisson claim frequency example and its 1 600 network calibrations illustrated in Fig. 7.17. Figure 7.19 provides the out-of-sample portfolio losses <sup>D</sup>*(<sup>T</sup> ,μ*¯ *(*1:*M))* of the resulting nagging predictors*(μ*¯ *(*1:*M)(x*† *<sup>t</sup> ))*1≤*t*≤*<sup>T</sup>* for 1 ≤ *M* ≤ 40 in red color, and the corresponding 1 standard deviation confidence bounds in orange color. The blue horizontal dotted line shows the case *M* = 1 which exactly refers to the (first) bias regularized FN network *<sup>μ</sup>m*=<sup>1</sup> with embedding layers given in Table 7.5. Indeed, averaging over multiple networks improves the predictive model and the out-of-sample loss decreases over the first 2 ≤ *M* ≤ 10 nagging steps. After the first 10 steps the picture starts to stabilize which indicates that for this size of portfolio (and this type of problem) we need to average over roughly 10–20 FN networks to receive optimal predictive models on the portfolio level. For *M* → ∞ the out-of-sample loss converges to the green horizontal dotted line in Fig. 7.19 of 23*.*<sup>783</sup> · <sup>10</sup>−2. These numbers are also reported on the last line of Table 7.9.

Figure 7.20 provides the empirical auto-calibration property (7.39) of the nagging predictor *<sup>μ</sup>*¯ *(*1:1600*)* ; this is obtained completely analogously to Fig. 7.12.

**Table 7.9** Run times, number of parameters, in-sample and out-of-sample deviance losses (units are in 10−2) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of Table 5.5, the FN network models (with embedding layers of dimension *b* = 2), and the nagging predictor for *M* = 1 600


**Fig. 7.20** Empirical auto-calibration (7.39) of the Poisson nagging predictor, the blue line shows the empirical density of *viμ*¯ *(*1:1600*) (xi)*, 1 ≤ *i* ≤ *n*

The nagging predictors are (already) bias regularized, and Fig. 7.20 supports that the auto-calibration property holds rather accurately.

At this stage, we have fully arrived at Breiman's [53] two modeling cultures dilemma, see also Sect. 1.1. We have started from a parametric data model, and in order to boost its predictive performance we have combined such models in an algorithmic way. Working with many blended networks is not really practical, therefore, in such situations, a meta model can be fitted to the resulting nagging predictor.

#### **Meta Model**

Since working with *M* = 1 600 different FN networks is not practical, we fit a meta model to the nagging predictors *<sup>μ</sup>*¯ *(*1:*M)(*·*)*. This can easily be done by just selecting an additional FN network and fit this additional network to the working data

$$\mathcal{D}^{\*} = \left\{ \left( \bar{\mu}^{(\text{l}:M)}(\mathbf{x}\_{l}), \mathbf{x}\_{l}, v\_{l} \right) : l = 1, \dots, n \right\} \cup \left\{ \left( \bar{\mu}^{(\text{l}:M)}(\mathbf{x}\_{l}^{\dagger}), \mathbf{x}\_{l}^{\dagger}, v\_{l}^{\dagger} \right) : t = 1, \dots, T \right\}.$$

**Table 7.10** Run times, number of parameters, in-sample and out-of-sample deviance losses (units are in 10−2) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of Table 5.5, the FN network model (with embedding layers of dimension *b* = 2), the nagging predictor, and the meta network model


For this calibration step we can consider all data, since we would like to fit a regression model as accurately as possible to the entire regression surface formed by all nagging predictors from the learning and the test data sets *L* and *T* . Moreover, this step should not over-fit since this regression surface of nagging predictors does not include any noise, but it is on the level of expected values. As network architecture we choose again the same FN network of depth *d* = 3. The only change to the fitting procedure above is replacing the Poisson deviance loss by the square loss function, since we do not work with the Poisson responses *Ni* but rather with their mean estimates *<sup>μ</sup>*¯ *(*1:*M)(xi)* and *<sup>μ</sup>*¯ *(*1:*M)(x*† *<sup>t</sup> )* in this fitting step. Since the resulting meta network model may still have a bias we apply the bias regularization step of Listing 7.7 to the Poisson observations with the Poisson deviance loss on the learning data *L* (only). The results are presented in Table 7.10.

From these results we observe that in our case the meta network performs similarly well to the nagging predictor, and it seems to be a very reasonable choice.

Finally, in Fig. 7.21 (lhs) we analyze the resulting frequencies on an individual policy level on the test data set *<sup>T</sup>* . We plot the estimated frequencies *<sup>μ</sup>m*=1*(x*† *<sup>t</sup> )* of the first FN network (this corresponds to 'embed FN bias regularized' in Table 7.10 with an out-of-sample loss of 23.824) against the nagging predictor *<sup>μ</sup>*¯ *(*1:*M)(x*† *t )* which averages over *M* = 1 600 networks. From Fig. 7.21 (lhs) we conclude that there are quite some differences between these two predictors, this exactly reflects the variations obtained in Fig. 7.18 (lhs). The nagging predictor removes this variation by averaging. Figure 7.21 (rhs) compares the nagging predictor *<sup>μ</sup>*¯ *(*1:*M)(x*† *t )* to the one of the meta model *<sup>μ</sup>*meta*(x*† *<sup>t</sup> )*. This scatter plot shows that the predictors lie almost perfectly on the diagonal line which suggests that the meta model can be used as a substitute for the nagging predictor. This completes this claim frequency modeling example.

*Remark 7.27* The meta model concept can also be useful in other situations. For instance, we can fit a gradient boosting regression model to the observations. Typically, this is much faster than calculating a nagging predictor (because it directly focuses on the weaknesses of the existing model). If the gradient boosting model is based on regression trees, it has the disadvantage that the resulting regression

**Fig. 7.21** Scatter plot of the out-of-sample predictions *μm*=1*(x*† *<sup>t</sup> )*, *<sup>μ</sup>*¯ *(*1:*M)(x*† *<sup>t</sup> )* and *<sup>μ</sup>*meta*(x*† *<sup>t</sup> )* over all polices 1 <sup>≤</sup> *<sup>t</sup>* <sup>≤</sup> *<sup>T</sup>* on the test data set *<sup>T</sup>* : (lhs) *μm*=1*(x*† *<sup>t</sup> )* vs. *<sup>μ</sup>*¯ *(*1:*M)(x*† *<sup>t</sup> )* and (rhs) *<sup>μ</sup>*meta*(x*† *t )* vs. *<sup>μ</sup>*¯ *(*1:*M)(x*† *<sup>t</sup> )*; the color scale shows the exposures *<sup>v</sup>*† *<sup>t</sup>* ∈ *(*0*,* 1]

function is not continuous, and a non-constant extrapolation might be an issue. In a second step we can fit a meta FN network model to the former regression model, lifting the boosting model to a smooth network that allows for a non-constant extrapolation.

*Example 7.28 (Gamma Claim Size Modeling)* We revisit the gamma claim size example of Sect. 5.3.7. The data comprises Swedish motorcycle claim amounts. We have seen that this claim size data is not heavy-tailed, thus, a gamma distribution may be a reasonable choice for this data. For the modeling of this data we use the same normalization is in (5.45), this parametrization does not require the explicit knowledge of the (constant) shape parameter of the gamma distribution for mean estimation.

The difficulty with this data is that only 656 insurance policies suffer a claim, and likely a single FN network will not lead to stable results in this example. As FN network architecture we again choose a network of depth *d* = 3 and with *(q*1*, q*2*, q*3*)* = *(*20*,* 15*,* 10*)* neurons. Since the input layer has dimension *q*<sup>0</sup> = 1 + 6 = 7 we receive a network parameter of dimension *r* = 626. As loss function we choose the gamma deviance loss, see Table 4.1. Moreover, we choose the nadam optimizer, a batch size of 300, a training-validation split of 8:2, and we retrieve the network calibration with the lowest validation loss with a callback.

Figure 7.22 shows the results of 1 000 different SGD runs (only differing in the initial seeds and the splits of the training-validation sets as well as the batches). We see a considerable variation between the different SGD runs, both in in-sample deviance losses but also in the average estimated claims. Note that we did not biasregularize the resulting networks (we work with the log-link here which is not the canonical one). This is why we receive fluctuating portfolio averages in Fig. 7.22

**Fig. 7.22** Boxplots over 1 000 network calibrations only differing in the seeds for the SGD algorithm and the partitioning of the learning-validation data: (lhs) in-sample losses on the (entire) data *L* and (rhs) average estimated claims

**Fig. 7.23** Coefficients of variations Vco*<sup>i</sup>* on an individual claim level 1 ≤ *i* ≤ *n* over the 1 000 calibrations (lhs) scatter plot against the nagging predictor *<sup>μ</sup>*¯ *(*1:*M)(xi)* and (rhs) histogram

(rhs), the red line illustrates the empirical mean. Obviously, these FN networks are (on average) positively biased, and they will need a bias correction for the final prediction.

Figure 7.23 analyzes the variations on an individual claim level by studying the in-sample version of the coefficient of variation given in (7.43). We see that these coefficients of variation are bigger than in the claim frequency example, see Fig. 7.18. Thus, to receive stable results the nagging predictors *<sup>μ</sup>*¯ *(*1:*M)(xi)* have to be calculated over many networks. Figure 7.24 confirms that aggregating reduces (insample) losses also in this case. From this figure we also see that the convergence is slower compared to the MTPL frequency example of Fig. 7.19, of course, because we have a much smaller claims portfolio.

**Table 7.11** Number of parameters, Pearson's dispersion estimate, MLE dispersion estimate, insample losses and in-sample average claim amounts of the null model (gamma intercept model), the gamma GLMs and the network nagging predictor; for the GLMs we refer to Table 5.13


Table 7.11 presents the results if we take the nagging predictor over 1 000 different networks. The first observation is that we receive a much smaller in-sample loss compared to the GLMs, thus, there seems to be much room for improvements in the GLMs. Secondly, the nagging predictor has a substantial bias. For this reason we shift the intercept parameter in the output layer so that the portfolio average of the nagging predictor is equal to the empirical mean, see the last column of Table 7.11.

A main difficulty in this model is the estimation of the dispersion parameter *ϕ >* 0 and the shape parameter *α* = 1*/ϕ* of the gamma distribution, respectively. Pearson's dispersion estimate does not work because we do not know the degrees of freedom of the nagging predictor, see also (5.49). In Table 7.11 we calculate Pearson's dispersion estimate by simply dividing by the number of observations; this should be understood as a lower bound; this number is highlighted in italic. Alternatively, we can calculate the MLE, however, this may be rather different from Pearson's estimate, as indicated in Table 7.11. Figure 7.25 (lhs) shows the resulting QQ plot of the nagging predictor if we use the MLE *<sup>ϕ</sup>*MLE <sup>=</sup> <sup>1</sup>*.*240, and the righthand side shows the same plot for *<sup>ϕ</sup>* <sup>=</sup> <sup>1</sup>*.*050. From these plots it seems that we should rather go for a smaller dispersion parameter, the MLE being probably too much dominated by the small claims. This observation should also be understood as a red flag, as it tells us that the chosen gamma model is not fully suitable. This may

**Fig. 7.25** QQ plots of the nagging predictors against the gamma density with (lhs) *<sup>ϕ</sup>*MLE <sup>=</sup> <sup>1</sup>*.*<sup>240</sup> and (rhs) *<sup>ϕ</sup>* <sup>=</sup> <sup>1</sup>*.*<sup>050</sup>

**Fig. 7.26** (lhs) Scatter plot of model Gamma GLM2 predictors against the nagging predictors *<sup>μ</sup>*¯ *(*1:*M)(xi)* over all instances 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*, (rhs) scatter plot of two (independent) nagging predictors

be for various reasons: (1) the dispersion is not constant and should be modeled policy dependent, (2) the features are not sufficient to explain the observations, or (3) the gamma distribution is not suitable and should be replaced by another distribution.

In Fig. 7.26 (lhs) we compare the predictions received from model Gamma GLM2 against the nagging predictors *<sup>μ</sup>*¯ *(*1:*M)(xi)* over all instances 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*. The scatter plot spreads quite wildly around the diagonal which seriously questions at least one of the two models. To ensure that this variability between the two models is not caused by the (complex) FN network architecture, we verify the nagging

predictor *<sup>μ</sup>*¯ *(*1:*M)*, *<sup>M</sup>* <sup>=</sup> <sup>1</sup> 000, by computing a second independent one. Indeed, Fig. 7.26 shows that these two independent nagging predictors come to the same conclusion on the individual instance level. Thus, the network finds/uses systematic effects that are not present in model Gamma GLM2. If we perform a pairwise interaction analysis for boosting the GLM as in Example 7.23, we find that we should add interactions to the GLM between (VehAge, RiskClass), (VehAge, BonusClass), (OwnerAge, Area), and (OwnerAge, VehAge); recall that model Gamma GLM2 neither includes BonusClass nor Gender as supported by a drop1 backward elimination analysis from model Gamma GLM1. However, it turns out, here, that we should have BonusClass in the model by letting it interact with VehAge.

Finally, Fig. 7.27 shows the empirical auto-calibration behavior (7.39) of the Gamma FN network nagging predictor of Table 7.11. The resulting black dots are rather volatile which shows that we do not (fully) have the auto-calibration property, here, but it also expresses that we fit a model on only 656 claims. The prediction of these claims is highlighted by the blue empirical density given by *<sup>μ</sup>*¯ *(*1:*M)(xi)*, 1 ≤ *i* ≤ *n*. On the positive side, the auto-calibration plot shows that we neither systematically under- nor over-estimate because the black dots fluctuate around the diagonal red line, only the upper tail seems to under-estimate the true claim size. -

#### **Ensembling over Selected Networks vs. All Networks**

Zhou et al. [406] ask the question whether ensembling over 'selected' networks is better than ensembling over all networks. In their proposal they introduce a weighted averaging scheme over the different network predictors *<sup>μ</sup>m*, 1 <sup>≤</sup> *<sup>m</sup>* <sup>≤</sup> *<sup>M</sup>*. We perform a slightly different analysis here. We are re-using the *M* = 1 600 SGD calibrations of the Poisson FN network illustrated in Fig. 7.17. We order these SGD calibrations w.r.t. their in-sample losses <sup>D</sup>*(L, <sup>μ</sup>m)*, 1 <sup>≤</sup> *<sup>m</sup>* <sup>≤</sup> *<sup>M</sup>*, and partition this ordered sample into three equally sized sets: the first one containing the smallest **Fig. 7.28** Empirical density of the in-sample losses <sup>D</sup>*(L, μm)*, 1 <sup>≤</sup> *<sup>m</sup>* <sup>≤</sup> *<sup>M</sup>*, of Fig. 7.17

in-sample losses, the second one the middle sized in-sample losses, and the third one the largest in-sample losses. Figure 7.28 shows the empirical density of these in-sample losses, and the vertical lines give the partition into the three sets, we call the resulting (disjoint) index sets *<sup>I</sup>*small*, <sup>I</sup>*middle*, <sup>I</sup>*large ⊂ {1*,...,M*}. Remark that this partition is done fully *in-sample*, based on the learning data *L*, only.

We then consider the nagging predictors on each of these index sets separately, i.e.,

$$
\bar{\mu}^{\text{small}}(\mathbf{x}) = \frac{1}{|\mathcal{Z}^{\text{small}}|} \sum\_{m \in \mathcal{Z}^{\text{small}}} \widehat{\mu}^{m}(\mathbf{x}),
$$

$$
\bar{\mu}^{\text{midddle}}(\mathbf{x}) = \frac{1}{|\mathcal{Z}^{\text{midddle}}|} \sum\_{m \in \mathcal{Z}^{\text{midddle}}} \widehat{\mu}^{m}(\mathbf{x}), \tag{7.46}
$$

$$
\bar{\mu}^{\text{large}}(\mathbf{x}) = \frac{1}{|\mathcal{Z}^{\text{large}}|} \sum\_{m \in \mathcal{Z}^{\text{large}}} \widehat{\mu}^{m}(\mathbf{x}).
$$

If we believe into the orange cubic spline in Fig. 7.17, the middle nagging predictor *<sup>μ</sup>*¯ middle should out-perform the other two nagging predictors. Indeed, this is the case, here. We receive the out-of-sample losses (in 10−2) on the three subsets

$$\mathfrak{D}(\mathcal{T}, \bar{\mu}^{\rm small}) = 23.784, \quad \mathfrak{D}(\mathcal{T}, \bar{\mu}^{\rm middle}) = 23.272, \quad \mathfrak{D}(\mathcal{T}, \bar{\mu}^{\rm large}) = 23.782. \tag{7.47}$$

This approach boosts by far any other approach considered, see Table 7.10; note that this analysis relies on a fully proper in-sample and out-of-sample testing strategy. Moreover, this also supports our early stopping strategy because, obviously, the optimal networks are centered around our early stopping rule. How does this result match Proposition 7.25 saying that the nagging predictor has a monotonically

decreasing deviance loss. For the convergence (7.45) we need unbiasedness, and (7.47) indicates that averaging over all *M* network calibrations results in biases on an *individual* policy level; on the aggregate portfolio level, we have applied the bias regularization step (7.33), but this does not act on an individual policy level. The latter would require a local balance correction similar to the GAM approach presented in Example 7.19.

Figure 7.29 is truly striking! It compares the nagging predictors *<sup>μ</sup>*¯ *(*1:*M)(x*† *t )* to the ones *<sup>μ</sup>*¯ middle*(x*† *<sup>t</sup> )* only using the calibrations *<sup>m</sup>* <sup>∈</sup> *<sup>I</sup>*middle, i.e., only using the calibrations with middle sized in-sample losses. The different colors show the exposures *v*† *<sup>t</sup>* ∈ *(*0*,* 1]. We observe that only portfolios with short exposures do not lie on the diagonal line. Thus, there seems to be an issue with insurance policies with short exposures. Recall that we model the Poisson claim counts *Ni* using the assumption, see (5.27),

$$N\_l \sim \text{Poi}(v\_l \mu(\mathfrak{x}\_l)).\tag{7.48}$$

That is, the expected claim count <sup>E</sup>*θi*[*Ni*] = *viμ(xi)* is assumed to scale proportionally in the exposure *vi >* 0. Figure 7.29 raises some doubts whether this is really the case, or at least SGD fitting has some difficulties to assess the expected frequencies *μ(xi)* on the policies *i* with short exposures *vi >* 0. We discuss this further in the next subsection. Table 7.12 gives a summary of our results.

#### **Analysis of Over-dispersion**

With all the excitement of Fig. 7.29, the above models do not fit the observations since the over-dispersion is too large, see the last column of Table 7.12. This has motivated the study of the negative binomial model in Sect. 5.3.5, the ZIP model in Sect. 5.3.6, and the hurdle Poisson model in Example 6.19. These models have led to an improvement in terms of AIC, see Table 6.6. We could go down the same



route here by substituting the Poisson model. We refrain from doing so, as we want to further analyze the Poisson model. Suppose we calculate an AIC value for the Poisson FN network using 792 as the number of parameters involved. In that case, we receive a value of 191 790, thus, clearly lower than the one of the negative binomial GLM, and also slightly lower than the one of the hurdle Poisson model, see Table 6.6. Remark that AIC values within FN networks are not supported by any theory as we neither use the MLE nor do we have a reasonable evaluation of the number of parameters involved in networks. Thus, such a value may serve at best as a rough rule of thumb.

This lower AIC value suggests that we should try to improve the modeling of the systematic effects by better regression functions. In particular, there may be more explanatory variables involved that have predictive power. If these explanatory variables are latent, we can rely on the negative binomial model, as it can be interpreted as a mixture model averaging over latent variables. In view of Fig. 7.29, the exposures *vi* seem to have a predictive power different from proportional scaling, see (7.48); we also mention some peculiarities of the exposures on page 556. This motivates to change the FN network regression model such that the exposures are considered non-proportionally. We choose a FN network that directly models the mean of the claim counts

$$
\mu(\mathbf{x}, \upsilon) \in \mathcal{X} \times (0, 1] \mapsto \mu(\mathbf{x}, \upsilon) = \exp \left| \mathfrak{f}, \mathbf{z}^{(d:1)}(\mathbf{x}, \upsilon) \right\rangle > 0,\tag{7.49}
$$

modeling the mean <sup>E</sup>*<sup>ϑ</sup>* [*N*] = *μ(x, v)* of the Poisson datum *(N, <sup>x</sup>, v)*. The expected frequency is then given by <sup>E</sup>*<sup>ϑ</sup>* [*<sup>Y</sup>* ] = <sup>E</sup>*<sup>ϑ</sup>* [*N/v*] = *μ(x, v)/v*.

*Remark 7.29* At this stage we clearly have to distinguish between statistical modeling and actuarial modeling. In statistical modeling it makes perfect sense to choose the regression function (7.49), since including the exposure in a nonproportional way may increase the predictive power of the model, at least this is what our data suggests.

From an actuarial point of view this approach should clearly be doubted. The typical exposure of car insurance policies is one calendar year, i.e., *v* = 1, if the renewals of insurance policies are accounted correctly. Shorter exposures may have a specific (non-predictable) reason, for example, the policyholder or the insurance company may terminate an insurance contract after a claim. Thus, if this is possible, the exposure is a random variable, too, and it clearly has a predictive power for claims prediction; in that case we lose the properties of the Poisson count process (having independent and stationary increments).

As a consequence, we should include the exposure proportionally from an actuarial modeling point of view. Nevertheless we do the modeling exercise based on the regression function (7.49), here. This will indicate the predictive power of the exposure, which may be thought of a proxy for another (non-available) explanatory variable. Moreover, if (7.49) allows for a good Poisson regression model, we have a simple way of bootstrapping from our data (conditionally on given exposures *v*).

We would also like to emphasize that if one feature component dominates all others in terms of the predictive power, then likely there is a leakage of information through this component, and this needs a more careful analysis.

We implement the FN network regression model (7.49) using again a network architecture of depth *d* = 3 with *(q*1*, q*2*, q*3*)* = *(*20*,* 15*,* 10*)* neurons. We use embedding layers for the two categorical variables VehBrand and Region, and we have 8 continuous/binary feature components. This is one more compared to Fig. 7.9 (rhs) because we also model the exposure *vi* as a continuous input to the network. As a result, the dimension *<sup>r</sup>* of the network parameter *<sup>ϑ</sup>* <sup>∈</sup> <sup>R</sup>*<sup>r</sup>* increases from 792 to 812 (because we have *q*<sup>1</sup> = 20 neurons in the first FN layer). We calculate the nagging predictor *<sup>μ</sup>*¯ *(*1:*M)* of this network averaging over *<sup>M</sup>* <sup>=</sup> <sup>500</sup> individual (early stopped) FN network calibrations, the results are presented in Table 7.13.

**Table 7.13** Number of parameters, in-sample and out-of-sample deviance losses (units are in 10−2), in-sample average frequency and (over-)dispersion of the Poisson null model, model Poisson GLM3 of Table 5.5, the FN network models (with embedding layers of dimension *b* = 2), the nagging predictors, and the middle nagging predictors excluding and including exposures *vi* as continuous network inputs


We observe a major improvement when including the exposure *v* as an input to the network, i.e., by including the exposure non-proportionally into the mean estimate. This is true in-sample (we use early stopping here), and in terms of Pearson's dispersion estimate; we set *r* = 812 for the number of parameters in Pearson's dispersion estimate (5.30) which may be too big because we do not perform proper MLE, here. In particular, we receive a dispersion estimate close to one which, now, is in support of modeling the claim counts by Poisson random variables (using this regression function). That is, this regression function explains the systematic effects so that we no longer observe much over-dispersion in the data relative to the chosen model. However, we would like to remind of Remark 7.29 which needs a careful consideration for the use of this regression model in insurance practice.

This is also supported by Fig. 7.30 which studies the average frequency as a function of the exposure *v* ∈ *(*0*,* 1]. The red observed average frequency has a clear decreasing slope which can be modeled by running the exposure *v* through the FN network (black), but not by including it proportionally (blue). From an actuarial modeling point of view this plot clearly questions the quality of the data, because there seem to be effects in the exposures that certainly require more investigation. Unfortunately, we cannot do this here because we do not have additional insight into this data set. This closes the example.

## *7.4.5 Identifiability in Feed-Forward Neural Networks*

In the previous section we have studied ensembles of FN networks. One may also aim at directly comparing these networksto each other in terms of the fitted network parameters *<sup>ϑ</sup><sup>j</sup>* over the different calibrations 1 <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>M</sup>* (of the same FN network architecture). Such a comparison may, e.g., be useful if one wants to choose a prior parameter distribution *π* for *ϑ* in a Bayesian setting. Comparing the different network calibrations *<sup>ϑ</sup><sup>j</sup>* , 1 ≤ *j* ≤ *M*, of an architecture needs some care because networks have many symmetries that make the parameters non-identifiable. We can, for instance, permute the neurons in a FN layer *z(m)*, with the corresponding permutation of the weights that connect this layer to the previous layer *z(m*−1*)* and to the succeeding layer *z(m*+1*)* . The resulting predictive model under this permutation is the same as the original one. For this reason we need to introduce some order in a FN network to make the parameters identifiable.

Rüger–Ossen [323] have introduced the notion of a fundamental domain for the network parameter *ϑ*, and we briefly review this idea. We start with an explicit example. Assume that the activation function fulfills the anti-symmetry property <sup>−</sup>*φ(x)* <sup>=</sup> *φ(*−*x)* for all *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>, this is the case for the hyperbolic tangent. This implies several symmetries in the FN network parametrization. E.g., if we consider the output of a shallow FN network *d* = 1 with link function *g*, we can do a sign switch in a fixed neuron 1 ≤ *k* ≤ *q*<sup>1</sup>

$$g(\boldsymbol{\mu}(\mathbf{x})) = \beta\_0 + \sum\_{j=1}^{q\_1} \beta\_j z\_j^{(1:1)}(\mathbf{x}) \ = \beta\_0 + \sum\_{j=1}^{q\_1} \beta\_j \ \phi \left< \mathbf{w}\_j^{(1)}, \mathbf{x} \right>$$

$$= \beta\_0 + \sum\_{j \neq k} \beta\_j \ \phi \left< \mathbf{w}\_j^{(1)}, \mathbf{x} \right> + (-\beta\_k) \ \phi \left< -\mathbf{w}\_k^{(1)}, \mathbf{x} \right>. \tag{7.50}$$

From this we see that the following two network parameters (we switch signs in all the parameters that belong to index *k*)

$$\begin{aligned} \mathfrak{b} &= (\mathfrak{w}\_1^{(1)}, \dots, \mathfrak{w}\_k^{(1)}, \dots, \mathfrak{w}\_{q\_1}^{(1)}, \beta\_0, \dots, \beta\_k, \dots, \beta\_{q\_1})^\top & \text{and} \\ \widetilde{\mathfrak{b}} &= (\mathfrak{w}\_1^{(1)}, \dots, -\mathfrak{w}\_k^{(1)}, \dots, \mathfrak{w}\_{q\_1}^{(1)}, \beta\_0, \dots, -\beta\_k, \dots, \beta\_{q\_1})^\top \end{aligned}$$

give the same FN network predictions. Beside these sign switches, we can also permute the enumeration of the neurons in a given FN layer, giving the same predictions. We discuss Theorem 2 of Rüger–Ossen [323] to solve this identifiability issue. First, we consider the network weights from the input *x* to the first FN layer *z(*1*) (x)*. Apply the sign switch operation (7.50) to the neurons in the first FN layer so that all the resulting intercepts *w(*1*)* 0*,*1*,...,w(*1*)* <sup>0</sup>*,q*<sup>1</sup> are positive while not changing the regression function *x* → *g(μ(x))*. Next, apply a permutation to the indices 1 ≤ *j* ≤ *q*<sup>1</sup> so that we receive ordered intercepts

$$w\_{0,1}^{(1)} > \dots > w\_{0,q\_1}^{(1)} > 0,$$

with an unchanged regression function *x* → *g(μ(x))*. To make these transformations well-defined we need to assume that all intercepts are non-zero and mutually different (which we assume for the time-being).

Then, we move recursively through the FN layers 2 ≤ *m* ≤ *d* applying the sign switch operations and the permutations so that the regression function *x* → *g(μ(x))* remains unchanged and such that for all 1 ≤ *m* ≤ *d*

$$w\_{0,1}^{(m)} > \dots > w\_{0,q\_m}^{(m)} > 0.$$

This provides us with a unique representation of every network parameter *<sup>ϑ</sup>* <sup>∈</sup> <sup>R</sup>*<sup>r</sup>* in the *fundamental domain*

$$\left\{\mathfrak{d}\in\mathbb{R}^{r};\ w\_{0,1}^{(m)}>\ldots>w\_{0,q\_{m}}^{(m)}>0\text{ for all }1\le m\le d\right\}\subset\mathbb{R}^{r},\tag{7.51}$$

supposed that all intercepts are different from zero and mutually different in the same FN layers. As stated in Section 2.2 of Rüger–Ossen [323], there may still exist different parameters in this fundamental domain that provide the same predictive model, but these are of zero Lebesgue measure. The same applies to the intercepts *w(m)* <sup>0</sup>*,j* being zero or having equal intercepts for different neurons. Basically, this means that we are fine if we work with absolutely continuous prior distributions on the fundamental domain when we want to work within a Bayesian setup.

## **7.5 Auto-encoders**

Auto-encoders are tools that aim at reducing the dimension of high-dimensional data such that the reconstruction error of the original data is small, i.e., such that the loss of information by the dimension reduction is minimized. The most popular auto-encoder is the principal components analysis (PCA) which we are going to present here. The PCA is a linear dimension reduction technique. Bottleneck neural (BN) networks can be viewed as a non-linear extension of the PCA. This is going to be discussed in Sect. 7.5.5, below. Dimension reduction techniques belong to the family of unsupervised learning methods because they do not consider a response variable, but they aim at finding common structure in the features. Unsupervised learning methods can roughly be categorized into three classes: dimension reduction techniques (studied in this section), clustering methods and visualization methods. For a discussion of clustering and visualization methods we refer to the tutorial of Rentzmann–Wüthrich [310].

## *7.5.1 Standardization of the Data Matrix*

Assume we have *<sup>q</sup>*-dimensional data points *<sup>y</sup><sup>i</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* , 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*. This provides us with a data matrix

$$\mathbf{y} = (\mathbf{y}\_1, \dots, \mathbf{y}\_n)^\top = \begin{pmatrix} \mathbf{y}\_{1,1} \cdot \cdots \cdot \mathbf{y}\_{1,q} \\ \vdots \cdot \ddots & \vdots \\ \mathbf{y}\_{n,1} \cdot \cdots \cdot \mathbf{y}\_{n,q} \end{pmatrix} \in \mathbb{R}^{n \times q}.$$

We assume that each of the *q* columns of *Y* measures a quantity in a given unit. The first column may, for instance, describe the age of a car driver in years, the second column his body weight in kilograms, etc. That is, each column 1 ≤ *j* ≤ *q* of *Y* describes a specific quantity, and each row *y*- *<sup>i</sup>* of *Y* describes these quantities for a given instance 1 ≤ *i* ≤ *n*. Since often the analysis should not depend on the units of the columns of *Y*, one centers the columns with the empirical means *<sup>y</sup>*¯*<sup>j</sup>* <sup>=</sup> *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *yi,j /n*, and one normalizes them with the empirical standard deviations *σj* <sup>=</sup> *( <sup>n</sup> <sup>i</sup>*=<sup>1</sup>*(yi,j* − ¯*yj )*2*/n)*1*/*2, 1 <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>q</sup>*. This gives the normalized data matrix

$$\begin{pmatrix} \frac{\mathbf{y}\_{1,1} - \bar{\mathbf{y}}\_1}{\widehat{\sigma}\_1} \cdot \dots \cdot \frac{\mathbf{y}\_{1,q} - \bar{\mathbf{y}}\_q}{\widehat{\sigma}\_q} \\ \vdots & \ddots & \vdots \\ \frac{\mathbf{y}\_{n,1} - \bar{\mathbf{y}}\_1}{\widehat{\sigma}\_1} \cdot \dots \cdot \frac{\mathbf{y}\_{n,q} - \bar{\mathbf{y}}\_q}{\widehat{\sigma}\_q} \end{pmatrix} \in \mathbb{R}^{n \times q}. \tag{7.52}$$

We typically center the data matrix *Y*, providing *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *yi,j* <sup>=</sup> <sup>0</sup> for all 1 <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>q</sup>*, normalization w.r.t. the standard deviation can be done, but is not always necessary. Centering implies that we can interpret *Y* as a *q*-dimensional empirical distribution with each component (column) being centered. The covariance matrix of this (centered) empirical distribution is calculated as

$$\widehat{\Sigma} = \frac{1}{n} \left( \sum\_{l=1}^{n} \mathbf{y}\_{l,l} \mathbf{y}\_{l,k} \right)\_{1 \le j,k \le q} = \frac{1}{n} \mathbf{Y}^{\top} \mathbf{Y} \in \mathbb{R}^{q \times q}. \tag{7.53}$$

This is a covariance matrix, and if the columns of *Y* are normalized with the empirical standard deviations *σj* , 1 <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>q</sup>*, this is a correlation matrix.

## *7.5.2 Introduction to Auto-encoders*

An auto-encoder encodes a high-dimensional vector *<sup>y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* to a low-dimensional representation so that the dimension reduction leads to a minimal loss of information. A function *L(*·*,*·*)* : <sup>R</sup>*<sup>q</sup>* <sup>×</sup> <sup>R</sup>*<sup>q</sup>* <sup>→</sup> <sup>R</sup><sup>+</sup> is called *dissimilarity function* if *L(y, y )* = 0 if and only if *y* = *y* .

An auto-encoder is a pair *(!, )* of mappings, for given dimensions *p<q*,

$$
\Phi: \mathbb{R}^q \to \mathbb{R}^p \qquad \text{and} \qquad \Psi: \mathbb{R}^p \to \mathbb{R}^q,\tag{7.54}
$$

such that their composition ◦ *!* has a small reconstruction error w.r.t. the chosen dissimilarity function *L(*·*,*·*)*, that is,

$$\mathbf{y} \mapsto L\left(\mathbf{y}, \boldsymbol{\Psi} \circ \boldsymbol{\Phi}(\mathbf{y})\right) \text{ is small for all cases } \mathbf{y} \text{ of interest.}\tag{7.55}$$

Note that we want (7.55) for selected cases *y*, and if they are within a *p*-dimensional manifold the auto-encoding will be successful. The first mapping *!* : <sup>R</sup>*<sup>q</sup>* <sup>→</sup> <sup>R</sup>*<sup>p</sup>* is called encoder, and the second mapping : <sup>R</sup>*<sup>p</sup>* <sup>→</sup> <sup>R</sup>*<sup>q</sup>* is called decoder. The object *!(y)* <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* is a *<sup>p</sup>*-dimensional encoding (representation) of *<sup>y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* which contains maximal information of *y* up to the reconstruction error (7.55).

## *7.5.3 Principal Components Analysis*

PCA gives us a linear auto-encoder (7.54). If the data matrix *<sup>Y</sup>* <sup>∈</sup> <sup>R</sup>*n*×*<sup>q</sup>* has rank *q*, there exist *q* linearly independent rows of *Y* that span R*<sup>q</sup>* . PCA determines a different, very specific basis of <sup>R</sup>*<sup>q</sup>* . It looks for an orthonormal basis *<sup>v</sup>*1*,..., <sup>v</sup><sup>q</sup>* <sup>∈</sup> R*<sup>q</sup>* such that *v*<sup>1</sup> explains the direction of the biggest variability in *Y*, *v*<sup>2</sup> the direction of the second biggest variability in *Y* orthogonal to *v*1, and so forth. Variability is understood in the sense of maximal empirical variance under the assumption that the columns of *Y* are centered, see (7.52)–(7.53). Such an orthonormal basis can be found by determining *q* linearly independent eigenvectors of the symmetric and positive definite matrix

$$A = n\widehat{\Sigma} = \mathbf{Y}^{\top}\mathbf{Y} \in \mathbb{R}^{q \times q}.$$

For this we can solve recursively the following convex Lagrange problems. The first basis vector *<sup>v</sup>*<sup>1</sup> <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* is determined by the solution of<sup>3</sup>

$$\left. \boldsymbol{\nu}\_{1} \right| = \mathop{\arg\max}\_{\left\| \boldsymbol{\mu} \right\|\_{2} = 1} \left\| \boldsymbol{Y} \boldsymbol{\mu} \right\|\_{2}^{2} = \mathop{\arg\max}\_{\left\| \boldsymbol{\mu}^{\top} \right\|\_{\boldsymbol{\nu}} = 1} \left( \boldsymbol{\mu}^{\top} \boldsymbol{Y}^{\top} \boldsymbol{Y} \boldsymbol{\mu} \right), \tag{7.56}$$

and the *<sup>j</sup>* -th basis vector *<sup>v</sup><sup>j</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* , 2 <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>q</sup>*, is received recursively by the solution of

$$\boldsymbol{\sigma}\_{j} = \underset{\|\boldsymbol{\mathfrak{w}}\|\_{2} = 1}{\text{arg}\,\text{max}} \; \|\boldsymbol{\mathfrak{Y}}\boldsymbol{\mathfrak{w}}\|\_{2}^{2} \qquad \text{subject to } \langle \boldsymbol{\mathfrak{v}}\_{k}, \boldsymbol{\mathfrak{w}} \rangle = 0 \text{ for all } 1 \le k \le j - 1. \tag{7.57}$$

<sup>3</sup> If the *q* eigenvalues of *A* are distinct, the solution to (7.56) and (7.57) is unique up to the sign, otherwise this requires more care.

#### 7.5 Auto-encoders 345

Singular value decomposition (SVD) gives an alternative way of computing this orthonormal basis, we refer to Section 14.5.1 in Hastie et al. [183]. The algorithm of Golub–Van Loan [165] gives an efficient way of performing a SVD. There exist orthogonal matrices *<sup>U</sup>* <sup>∈</sup> <sup>R</sup>*n*×*<sup>q</sup>* and *<sup>V</sup>* <sup>∈</sup> <sup>R</sup>*q*×*<sup>q</sup>* (with *<sup>U</sup>*-*U* = *V* -*<sup>V</sup>* <sup>=</sup> <sup>1</sup>*<sup>q</sup>* ), and a diagonal matrix<sup>=</sup> diag*(λ*1*,...,λq )* <sup>∈</sup> <sup>R</sup>*q*×*<sup>q</sup>* with singular values *<sup>λ</sup>*<sup>1</sup> <sup>≥</sup> *...* <sup>≥</sup> *λq >* 0 such that we have the SVD

$$\boldsymbol{Y} = \boldsymbol{U} \boldsymbol{\Lambda} \boldsymbol{V}^{\top}. \tag{7.58}$$

The matrix *U* is called left-singular matrix of *Y*, and the matrix *V* is called rightsingular matrix of *Y*. Observe by using the SVD (7.58)

$$V^\top AV = V^\top Y^\top Y V = V^\top V \Lambda U^\top U \Lambda V^\top V = \Lambda^2 = \text{diag}(\lambda\_1^2, \dots, \lambda\_q^2).$$

That is, the squared singular values *(λ*<sup>2</sup> *<sup>j</sup> )*1≤*j*≤*<sup>q</sup>* are the eigenvalues of matrix *A*, and the column vectors of the right-singular matrix *V* = *(v*1*,..., v<sup>q</sup> )* (eigenvectors of *A*) give an orthonormal basis *v*1*,..., v<sup>q</sup>* . This motivates to define the *q* principal components of *Y* by the column vectors of

$$\mathcal{Y}V = \mathcal{U}\Lambda = U\text{diag}(\lambda\_1, \dots, \lambda\_q) \\ \tag{7.59}$$

$$= (\lambda\_1 \mathfrak{u}\_1, \dots, \lambda\_q \mathfrak{u}\_q) \\ \quad \in \mathbb{R}^{n \times q}.$$

E.g., the first principal component of the instances 1 ≤ *i* ≤ *n* is given by *Y v*<sup>1</sup> = *<sup>λ</sup>*1*u*<sup>1</sup> <sup>∈</sup> <sup>R</sup>*n*. Considering the first *<sup>p</sup>* <sup>≤</sup> *<sup>q</sup>* principal components gives the rank *<sup>p</sup>* matrix

$$\boldsymbol{Y}\_p = \boldsymbol{U}\text{diag}(\lambda\_1, \dots, \lambda\_p, 0, \dots, 0)\boldsymbol{V}^{\top} \in \mathbb{R}^{n \times q}.\tag{7.60}$$

The Eckart–Young–Mirsky theorem [114, 279] <sup>4</sup> proves that this rank *p* matrix *Y <sup>p</sup>* minimizes the Frobenius norm relative to *Y* among all rank *p* matrices, that is,

$$\|Y\_p \in \operatorname\*{arg\,min}\_{B \in \mathbb{R}^{n \times q}} \|Y - B\|\_{\mathbb{F}} \qquad \text{subject to } \operatorname{rank}(B) \le p,\tag{7.61}$$

where the Frobenius norm is given by *<sup>C</sup>*<sup>2</sup> <sup>F</sup> = *i,j c*<sup>2</sup> *i,j* for a matrix *C* = *(ci,j )i,j* . The orthonormal basis *<sup>v</sup>*1*,..., <sup>v</sup><sup>q</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* gives the (linear) encoder (projection)

$$\Phi: \mathbb{R}^q \to \mathbb{R}^p, \qquad \mathbf{y} \mapsto \Phi(\mathbf{y}) = \left(\mathbf{y}^\top \mathbf{v}\_1, \dots, \mathbf{y}^\top \mathbf{v}\_p\right)^\top = (\mathbf{v}\_1, \dots, \mathbf{v}\_p)^\top \mathbf{y}.$$

<sup>4</sup> In fact, (7.61) holds for both the Frobenius norm and the spectral norm.

These gives the first *p* principal components in (7.59) if we insert the transposed data matrix *Y* - <sup>=</sup> *(y*1*,..., <sup>y</sup>n)* <sup>∈</sup> <sup>R</sup>*q*×*<sup>n</sup>* for *<sup>y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* . The (linear) decoder is given by

$$\Psi: \mathbb{R}^p \to \mathbb{R}^q, \qquad \mathfrak{z} \mapsto \Psi(\mathfrak{z}) = (\mathfrak{v}\_1, \dots, \mathfrak{v}\_p)\mathfrak{z}.$$

The following is understood column-wise for the transposed data matrix *Y* -,

$$\begin{aligned} \Psi \circ \Phi(\mathbf{Y}^{\top}) &= \Psi \left( (\mathfrak{v}\_{1}, \dots, \mathfrak{v}\_{p})^{\top} \mathbf{Y}^{\top} \right) \\ &= \left( \mathbf{Y}(\mathfrak{v}\_{1}, \dots, \mathfrak{v}\_{p})(\mathfrak{v}\_{1}, \dots, \mathfrak{v}\_{p})^{\top} \right)^{\top} \\ &= \left( \mathbf{Y}(\mathfrak{v}\_{1}, \dots, \mathfrak{v}\_{p}, 0, \dots, 0)(\mathfrak{v}\_{1}, \dots, \mathfrak{v}\_{p}, \mathfrak{v}\_{p+1}, \dots, \mathfrak{v}\_{q})^{\top} \right)^{\top} \\ &= \left( \mathbf{U} \text{diag}(\lambda\_{1}, \dots, \lambda\_{p}, 0, \dots, 0) \mathbf{V}^{\top} \right)^{\top} = \mathbf{Y}\_{p}^{\top} . \end{aligned}$$

Thus, ◦ *!(Y* -*)* minimizes the Frobenius reconstruction error (7.61) on the data matrix *Y* among all linear maps of rank *p*. In view of (7.55) we can express the squared Frobenius reconstruction error as

$$\left\|\|\mathbf{Y} - \mathbf{Y}\_P\|\right\|\_{\mathbf{F}}^2 = \sum\_{i=1}^n \left\|\mathbf{y}\_i - \boldsymbol{\Psi} \circ \boldsymbol{\Phi}(\mathbf{y}\_i)\right\|\_2^2 = \sum\_{i=1}^n L\left(\mathbf{y}\_i, \boldsymbol{\Psi} \circ \boldsymbol{\Phi}(\mathbf{y}\_i)\right),\tag{7.62}$$

thus, we choose the squared Euclidean distance as the dissimilarity measure, here, that we minimize simultaneously on all cases *yi*, 1 ≤ *i* ≤ *n*.

*Remark 7.30* The PCA gives a linear approximation to the data matrix *Y* by minimizing (7.61) and (7.62) for given rank *p*. This may not be appropriate if the non-linear terms are dominant. Figure 7.31 (lhs) gives a situation where the PCA works well; this data has been generated by i.i.d. multivariate Gaussian random vectors *y<sup>i</sup>* ∼ *N (***0***,)*. Figure 7.31 (middle) gives a non-linear example where the PCA does not work well, the data matrix *<sup>Y</sup>* <sup>∈</sup> <sup>R</sup>*n*×<sup>2</sup> is a column-centered matrix that builds a circle around the origin.

Another nice example where the PCA fails is Fig. 7.31 (rhs). This figure is inspired by Shlens [337] and Ruckstuhl [321]. It shows a situation where the level sets are non-convex, and the principal components point into a completely wrong direction to explain the structure of the data.

**Fig. 7.31** Two-dimensional PCAs in different situations of the data matrix *<sup>Y</sup>* <sup>∈</sup> <sup>R</sup>*n*×<sup>2</sup>

## *7.5.4 Lab: Lee–Carter Mortality Model*

We use the SVD to fit the most popular stochastic mortality model, the Lee–Carter (LC) model [238], to (raw) mortality data. The raw mortality data considers for each calendar year *t* and each age *x* the number of people *Dx,t* who died (in that year *t* at age *x*) divided by the corresponding population exposure *ex,t* . In practice this requires some care. Due to migration, often, the exposures *ex,t* are non-observable figures and need to be estimated. Moreover, also the death counts *Dx,t* in year *t* at age *x* can be defined differently, age cohorts are usually defined by the year of birth. We denote the (observed) raw mortality rates by *Mx,t* = *Dx,t/ex,t*. The subsequent derivations consider the raw log-mortality rates log*(Mx,t)*, for this reason we assume that *Mx,t >* 0 for all calendar years *t* and ages *x*. The goal is to model these raw log-mortality rates (for each country, region, risk group and gender separately).

The LC model defines the force of mortality as

$$
\log(\mu\_{\ge t}) = a\_{\ge} + b\_{\ge}k\_t,\tag{7.63}
$$

where log*(μx,t)* is the (deterministic) log-mortality rate in calendar year *t* for a person aged *x* (for a fixed country, region and gender). The individual terms in (7.63) have the following meaning: *ax* is the average force of mortality at age *x*, *bx* is the rate of change of the force of mortality broken down to the different ages *x*, and *kt* is the time index describing the change of the force of mortality in calendar year *t*.

Strictly speaking, we do not have a stochastic model, here, that can explain the observations *Mx,t*, but we try to fit a deterministic mortality surface *(μx,t)x,t* to these noisy observations*(Mx,t)x,t* . For this we use the PCA and the Frobenius norm as the measure of dissimilarity (on the log-scale).

In a first step, we center the raw log-mortality rates for all ages *x*, i.e., over the calendar years *t* ∈ *T* under consideration. We define the centered raw log-mortality rates *yx,t* and the estimate *ax* of the average force of mortality at age *<sup>x</sup>* as follows

$$Y\_{\boldsymbol{x},\boldsymbol{t}} = \log(M\_{\boldsymbol{x},\boldsymbol{t}}) - \widehat{a}\_{\boldsymbol{\lambda}} = \log(M\_{\boldsymbol{\lambda},\boldsymbol{t}}) - \frac{1}{|\mathcal{T}|} \sum\_{\boldsymbol{s} \in \mathcal{T}} \log(M\_{\boldsymbol{\lambda},\boldsymbol{s}}),\tag{7.64}$$

where the last identity defines the estimate *ax*. Strictly speaking we have a slight difference to the centering in Sect. 7.5.1 because we center the rows and not the columns of the data matrix, here, but the role of rows and columns is exchangeable in the PCA. The optimal (parameter) values *( bx)x* and *( kt)t* are determined as follows, see (7.63),

$$\underset{(b\_{\boldsymbol{X}})\_{\boldsymbol{X}},(\boldsymbol{k}\_{l})\_{\boldsymbol{l}}}{\arg\min} \sum\_{\boldsymbol{X},\boldsymbol{l}} \left(\boldsymbol{Y}\_{\boldsymbol{X},\boldsymbol{l}} - \boldsymbol{b}\_{\boldsymbol{X}}\boldsymbol{k}\_{l}\right)^{2},$$

where the sum runs over the years *t* ∈ *T* and the ages *x*<sup>0</sup> ≤ *x* ≤ *x*1, with *x*<sup>0</sup> and *x*<sup>1</sup> being the lower and upper age boundaries. This can be rewritten as an optimization problem (7.61)–(7.62). Consider the data matrix *<sup>Y</sup>* <sup>=</sup> *(Yx,t)x*0≤*x*≤*x*1;*t*∈*<sup>T</sup>* <sup>∈</sup> <sup>R</sup>*n*×*<sup>q</sup>* , and set *n* = *x*<sup>1</sup> − *x*<sup>0</sup> + 1 and *q* = |*T* |. Assume *Y* has rank *q*. This allows us to consider

$$\|Y\|\_{\mathbb{F}} \in \operatorname\*{arg\,min}\_{B \in \mathbb{R}^{n \times q}} \|Y - B\|\_{\mathbb{F}} \qquad \text{subject to } \operatorname{rank}(B) \le 1.$$

A solution to this problem is given, see (7.60),

$$\begin{array}{rcl} Y\_{\mathsf{I}} & = & U \text{diag}(\lambda\_{\mathsf{I}}, 0, \dots, 0) V^{\top} & = & (\lambda\_{\mathsf{I}} \mathsf{u}\_{\mathsf{I}}) \, \mathsf{v}\_{\mathsf{I}}^{\top} & = & (Y \mathsf{v}\_{\mathsf{I}}) \, \mathsf{v}\_{\mathsf{I}}^{\top} & \in \; \mathbb{R}^{\mathsf{u} \times q}, \end{array}$$

with left-singular matrix *<sup>U</sup>* <sup>=</sup> *(u*1*,..., <sup>u</sup><sup>q</sup> )* <sup>∈</sup> <sup>R</sup>*n*×*<sup>q</sup>* and right-singular matrix *<sup>V</sup>* <sup>=</sup> *(v*1*,..., <sup>v</sup><sup>q</sup> )* <sup>∈</sup> <sup>R</sup>*q*×*<sup>q</sup>* of *<sup>Y</sup>*. This implies that the first principal component *<sup>λ</sup>*1*u*<sup>1</sup> <sup>=</sup> *Y v*<sup>1</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* gives an estimate for *(bx)x*0≤*x*≤*<sup>x</sup>*1, and the first column vector *<sup>v</sup>*<sup>1</sup> <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* of *V* gives an estimate for the time index *(kt)t*<sup>∈</sup>*<sup>T</sup>* . For parameter identifiability we normalize

$$\sum\_{\chi=\chi\_0}^{\chi\_1} \widehat{b\_{\chi}} = 1 \qquad\text{and}\qquad \sum\_{t \in \mathcal{T}} \widehat{k\_t} = 0,\tag{7.65}$$

the latter being consistent with the centering of the rows of *<sup>Y</sup>* with *ax* in (7.64).

We fit the LC model to the Swiss mortality data of females and males separately. The raw log-mortality rates log*(Mx,t)* for the years *t* ∈ *T* = {1950*,...,* 2016} and the ages 0 ≤ *x* ≤ 99 are illustrated in Fig. 7.32; both plots use the same color scale. This mortality data has been obtained from the Human Mortality Database (HMD) [195]. In general, we observe a diagonal structure that indicates mortality improvements over time.

**Fig. 7.32** Raw log-mortality rates log*(Mx,t)* for the calendar years 1950 ≤ *t* ≤ 2016 and the ages *x*<sup>0</sup> = 0 ≤ *x* ≤ *x*<sup>1</sup> = 99 of Swiss females (lhs) and Swiss males (rhs); both plots use the same color scale

**Fig. 7.33** LC fitted log-mortality rates log*( μx,t)* for the calendar years 1950 <sup>≤</sup> *<sup>t</sup>* <sup>≤</sup> 2016 and the ages *x*<sup>0</sup> = 0 ≤ *x* ≤ *x*<sup>1</sup> = 99 of Swiss females (lhs) and Swiss males (rhs); the plots use the same color scale as Fig. 7.32

Define the fitted log-mortality surface

log*( μx,t)* <sup>=</sup> *ax* <sup>+</sup> *bx kt* for *<sup>x</sup>*<sup>0</sup> <sup>≤</sup> *<sup>x</sup>* <sup>≤</sup> *<sup>x</sup>*<sup>1</sup> and *<sup>t</sup>* <sup>∈</sup> *<sup>T</sup> .*

Figure 7.33 shows the LC fitted log-mortality surface *(*log*( μx,t))*<sup>0</sup>≤*x*≤99;*t*∈*<sup>T</sup>* separately for Swiss females and Swiss males, the color scale is the same as in Fig. 7.32. The plots show a huge similarity between the raw log-mortality data and the LC fitted log-mortality surface which clearly supports the LC model for the Swiss data. In general, the LC surface is a smoothed version of the raw log-mortality surface. The main difference in our LC fit concerns the male population for ages

**Fig. 7.34** (lhs) Singular values *λj* , <sup>1</sup> <sup>≤</sup> *<sup>j</sup>* ≤ |*<sup>T</sup>* <sup>|</sup>, of the SVD of the data matrix *<sup>Y</sup>* <sup>∈</sup> <sup>R</sup>*n*×|*<sup>T</sup>* <sup>|</sup> , and (rhs) the reconstruction errors *<sup>Y</sup>* <sup>−</sup> *<sup>Y</sup> <sup>p</sup>*<sup>2</sup> <sup>F</sup> for 0 ≤ *p* ≤ |*T* |

20 ≤ *x* ≤ 40 from 1980 to 2000, one explanation of the special pattern in the observed data during that time is the emergence of HIV.

Figure 7.34 (lhs) shows the singular values *λ*<sup>1</sup> ≥ *...* ≥ *λ*|*<sup>T</sup>* <sup>|</sup> *>* 0 for Swiss females and Swiss males. We observe that the first singular value *λ*<sup>1</sup> by far dominates the remaining singular values *λj* , *j* ≥ 2. Thus, the first principal component indeed may already be sufficient, and the centered raw log-mortality data *Y* can be described by a matrix *Y*<sup>1</sup> of rank *p* = 1. Figure 7.34 (rhs) gives the squared Frobenius reconstruction errors of the approximations *Y <sup>p</sup>* of ranks 0 ≤ *p* ≤ |*T* |, where *Y*<sup>0</sup> corresponds to the zero matrix where we do not use any approximation, but use just the average observed log-mortality rate. We observe that the first singular value leads by far to the biggest decrease in the reconstruction error, and the subsequent expansions *λj* , *j* ≥ 2, improve it only slightly in each step. This supports the use of the LC model using a rank *p* = 1 approximation to the centered raw log-mortality rates *Y*. The higher rank PCA within mortality modeling has been studied in Renshaw–Haberman (RH) [308], and the RH*(p)* mortality model considers the rank *p* approximation *Y <sup>p</sup>* to the raw log-mortality rates *Y* given by

$$
\log(\mu\_{\boldsymbol{x},l}) = a\_{\boldsymbol{x}} + \langle \boldsymbol{b}\_{\boldsymbol{x}}, \boldsymbol{k}\_{l} \rangle,
$$

for *<sup>b</sup><sup>x</sup> , <sup>k</sup><sup>t</sup>* <sup>∈</sup> <sup>R</sup>*p*.

We have (only) fitted a mortality surface to the raw log-mortality rates on the rectangle {*x*0*,...,x*1} × *T* . This does not allow us to forecast mortality into the future. Forecasting requires a two step procedure, which, after this first estimation step, extrapolates the time index (time-series) *( kt)t*<sup>∈</sup>*<sup>T</sup>* beyond the latest observation point in *T* . The simplest (meaningful) model for this second (extrapolation) step is a random walk with drift for the time index process *( kt)t*≥0. Figure 7.35 shows the estimated two-dimensional process *( <sup>k</sup>t)t*<sup>∈</sup>*<sup>T</sup>* , i.e., for *<sup>p</sup>* <sup>=</sup> 2, on the rectangle

**Fig. 7.35** Estimated two-dimensional processes *( <sup>k</sup>t)t*<sup>∈</sup>*<sup>T</sup>* for Swiss females (lhs) and Swiss males (rhs); these are normalized such that they are centered and such that the components of *<sup>b</sup><sup>x</sup>* add up to 1

{*x*0*,...,x*1} × *T* which needs to be extrapolated to predict within the RH (*p* = 2) mortality model. We refrain from doing this step, but extrapolation will be studied in Sect. 8.4, below.

## *7.5.5 Bottleneck Neural Network*

BN networks have become popular in studying non-linear generalizations of PCA, we refer to Kramer [225] and Hinton–Salakhutdinov [186]. The BN network architecture is such that (1) the input dimension *q*<sup>0</sup> is equal to the output dimension *qd*+<sup>1</sup> of a FN network, and (2) in between there is a FN layer 1 ≤ *m* ≤ *d* that has a very low dimension *qm* " *q*0, called the bottleneck. Figure 7.36 (lhs) shows such a BN network of depth *d* = 3 and neurons

$$(q\_0, q\_1, q\_2, q\_3, q\_4) = (20, 7, 2, 7, 20).$$

The input and output neurons have blue color, and the bottleneck of dimension *q*<sup>2</sup> = 2 is shown in red color in Fig. 7.36 (lhs).

**Fig. 7.36** (lhs) BN network of depth *d* = 3 with *(q*0*, q*1*, q*2*, q*3*, q*4*)* = *(*20*,* 7*,* 2*,* 7*,* 20*)*, (middle and rhs) shallow BN networks with a bottleneck of dimensions 7 and 2, respectively

The motivation is as follows. Assume we have a given dissimilarity function *L(*·*,*·*)* : <sup>R</sup>*<sup>q</sup>* <sup>×</sup> <sup>R</sup>*<sup>q</sup>* <sup>→</sup> <sup>R</sup><sup>+</sup> that measures the reconstruction error of an auto-encoder ◦*!(y)* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* relative to the original input *<sup>y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* , see (7.55). We try to find a BN network with input and output dimensions *q*<sup>0</sup> = *qd*+<sup>1</sup> = *q* (we drop the intercepts in the entire construction) and a bottleneck in layer *m* having a low dimension *qm*, such that the BN network provides a small reconstruction error. Choose a FN network

$$\mathbf{y} \in \mathbb{R}^q \mapsto \Psi \circ \Phi(\mathbf{y}) = \mathbf{z}^{(d+1:1)}(\mathbf{y}) = \left(\mathbf{z}^{(d+1)} \circ \mathbf{z}^{(d)} \circ \cdots \circ \mathbf{z}^{(l)}\right)(\mathbf{y}) \in \mathbb{R}^q, \mathbf{z}$$

with FN layers for 1 ≤ *m* ≤ *d* (excluding intercepts)

$$\pi^{(m)}: \mathbb{R}^{q\_{m-1}} \to \mathbb{R}^{q\_m}, \qquad \mathbf{z} \mapsto \mathbf{z}^{(m)}(\mathbf{z}) = \left(\phi\langle\mathbf{w}\_1^{(m)}, \mathbf{z}\rangle, \dots, \phi\langle\mathbf{w}\_{q\_m}^{(m)}, \mathbf{z}\rangle\right)^{\top},$$

and having network weights *w(m) <sup>j</sup>* <sup>∈</sup> <sup>R</sup>*qm*−<sup>1</sup> , 1 <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *qm*. For the output we choose the identity function as activation function

$$\mathbf{z}^{(d+1)}: \mathbb{R}^{q\_d} \to \mathbb{R}^{q\_{d+1}}, \qquad \mathbf{z} \mapsto \mathbf{z}^{(d+1)}(\mathbf{z}) = \left( \langle \mathbf{w}\_1^{(d+1)}, \mathbf{z} \rangle, \dots, \langle \mathbf{w}\_{q\_{d+1}}^{(d+1)}, \mathbf{z} \rangle \right)^{\top},$$

and having network weights *w(d*+1*) <sup>j</sup>* <sup>∈</sup> <sup>R</sup>*qd* , 1 <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *qd*+1. The resulting network parameter *ϑ* is now fitted to the data matrix *Y* = *(y*1*,..., yn)*- <sup>∈</sup> <sup>R</sup>*n*×*<sup>q</sup>* such that the reconstruction error is minimized over all instances

$$\widehat{\mathfrak{d}}' = \underset{\mathfrak{d} \in \mathbb{R}^{\prime}}{\operatorname{arg\,min}} \sum\_{l=1}^{n} L\left(\mathbf{y}\_{l}, \boldsymbol{\Psi} \circ \boldsymbol{\Phi}(\mathbf{y}\_{l})\right) \\ = \underset{\boldsymbol{\Psi} \in \mathbb{R}^{\prime}}{\operatorname{arg\,min}} \sum\_{l=1}^{n} L\left(\mathbf{y}\_{l}, \mathbf{z}^{(d+1:l)}(\mathbf{y}\_{l})\right).$$

We use this fitted network parameter *<sup>ϑ</sup>* and denote the resulting FN layers by *<sup>z</sup>(m)* for 1 ≤ *m* ≤ *d* + 1.

This allows us to define the BN encoder, set *q* = *q*<sup>0</sup> and *p* = *qm*,

$$\Phi: \mathbb{R}^{q\_0} \to \mathbb{R}^{q\_m}, \qquad \mathbf{y} \mapsto \Phi(\mathbf{y}) = \widehat{\mathbf{z}}^{(m;1)}(\mathbf{y}) = \left(\widehat{\mathbf{z}}^{(m)} \circ \cdots \circ \widehat{\mathbf{z}}^{(1)}\right)(\mathbf{y}), \tag{7.66}$$

and the BN decoder is given by, set *qm* = *p* and *qd*+<sup>1</sup> = *q*,

$$\Psi: \mathbb{R}^{q\_m} \to \mathbb{R}^{qd+1}, \quad \mathbf{z} \mapsto \Psi(\mathbf{z}) = \widehat{\mathbf{z}}^{(d+1:m+1)}(\mathbf{z}) = \left(\widehat{\mathbf{z}}^{(d+1)} \circ \cdots \circ \widehat{\mathbf{z}}^{(m+1)}\right)(\mathbf{z}).$$

The BN encoder (7.66) gives us a *qm*-dimensional representation of the data. A linear rank *p* representation *Y <sup>p</sup>* of *Y*, see (7.61), can be found by a BN network architecture that has a minimal FN layer width of dimension *p* = min1≤*j*≤*<sup>d</sup> qj* , and with the identity activation function *φ(x)* = *x*. Such a BN network is a linear map of maximal rank *p*. Using the Euclidean square distance as dissimilarity measure provides us an optimal network parameter *<sup>ϑ</sup>* for this linear map such that we receive *Y* - *<sup>p</sup>* <sup>=</sup> *<sup>z</sup>(d*+1:1*) (Y* -*)*. There is one point to be considered, here, why the bottleneck activations *!(y)* <sup>=</sup> *<sup>z</sup>(m*:1*) (y)* <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* in the linear activation case are not directly comparable to the principal components *(yv*1*,..., yvp)* of the PCA. Namely, the PCA uses an orthonormal basis *v*1*,..., v<sup>p</sup>* whereas the linear BN network case uses any *p*-dimensional basis, i.e., to directly bring these two representations in line we still need a coordinate transformation of the bottleneck activations.

Hinton–Salakhutdinov [186] noticed that the gradient descent fitting of a BN network needs some care, otherwise we may find a local minimum of the loss function that has a poor reconstruction performance. In order to implement a more sophisticated way of SGD fitting we require that the depth *d* of the network is an odd number and that the network architecture is symmetric around the central FN layer *(d* + 1*)/*2. This is the case in Fig. 7.36 (lhs). Fitting of this network of depth *d* = 3 is now done in three steps:


see Fig. 7.36 (rhs). This second step gives us the preliminary estimates for the network weights *w(*2*)* <sup>1</sup> *,..., <sup>w</sup>(*2*) <sup>q</sup>*<sup>2</sup> and *<sup>w</sup>(*3*)* <sup>1</sup> *,..., <sup>w</sup>(*3*) <sup>q</sup>*<sup>3</sup> of the full BN network.

3. In the final step we fit the full BN network on the data *Y* and use the preliminary estimates of the weights (of the previous two steps) as initialization of the gradient descent algorithm.

*Example 7.31 (BN Network Mortality Model)* We apply this BN network approach to modify the LC model of Sect. 7.5.4. Hainaut [178] considered such a BN network application. For computational reasons, Hainaut [178] proposed a calibration strategy different from Hinton–Salakhutdinov [186]. We use this latter calibration strategy as it has turned out to work well in our setting.

As BN network architecture we choose a FN network of depth *d* = 3. The input and output dimensions are equal to *q*<sup>0</sup> = *q*<sup>4</sup> = 67, this exactly corresponds to the number of available calendar years 1950 ≤ *t* ≤ 2016, see Fig. 7.32. Then, we select a symmetric architecture around the central FN layer *m* = 2 with *q*<sup>1</sup> = *q*<sup>3</sup> = 20 neurons. That is, in a first step, the 67 calendar years are compressed to a 20 dimensional representation. For the bottleneck we then explore different numbers of neurons *q*<sup>2</sup> = *p* ∈ {1*,...,* 20}. These BN networks are implemented and fitted in R with the library keras [77]. We have fitted these models separately to the Swiss female and male populations. The raw log-mortality rates are illustrated in Fig. 7.32, and for comparability with the LC approach we have centered these log-mortality rates according to (7.64), and we use the squared Euclidean distance as the objective function.

Figure 7.37 compares the squared Frobenius reconstruction errors of the linear LC approximations *Y <sup>p</sup>* to their non-linear BN network counterparts with bottlenecks *q*<sup>2</sup> = *p*. We observe that the BN figures are clearly smaller saying that a non-linear auto-encoding provides a better reconstruction, this is true, in particular, for 2 ≤ *q*<sup>2</sup> *<* 20. For *q*<sup>2</sup> ≥ 20 the learning with the BN networks seems saturated, note that the outer layers have *q*<sup>1</sup> = *q*<sup>3</sup> = 20 neurons which limits the learning at the bottleneck for bigger *q*2. In view of Fig. 7.37 there seems to be a kink at *q*<sup>2</sup> = 4,

**Fig. 7.37** Frobenius reconstruction errors *<sup>Y</sup>* <sup>−</sup> *<sup>Y</sup> <sup>p</sup>*<sup>2</sup> <sup>F</sup> for 1 ≤ *p* = *q*<sup>2</sup> ≤ 20 in the linear LC approach and the non-linear BN approach

**Fig. 7.38** BN network *(q*1*, q*2*, q*3*)* <sup>=</sup> *(*20*,* <sup>2</sup>*,* <sup>20</sup>*)* fitted log-mortality rates log*( μx,t)* for the calendar years 1950 ≤ *t* ≤ 2016 and the ages *x*<sup>0</sup> = 0 ≤ *x* ≤ *x*<sup>1</sup> = 99 of Swiss females (left) and Swiss males (right); the plots use the same color scale as Fig. 7.32

and an "elbow" criterion says that this is the critical bottleneck size that should not be exceeded.

The resulting estimated log-mortality surfaces for the bottleneck *q*<sup>2</sup> = 2 are illustrated in Fig. 7.38. These strongly resemble the raw log-mortality rates in Fig. 7.32, in particular, for the male population we get a better fit for ages 20 ≤ *x* ≤ 40 from 1980 to 2000 compared to the LC model. In a further analysis we should check whether this BN network does not over-fit to the data. We could, e.g., explore drop-outs during calibration or smaller FN (compression) layers *q*<sup>1</sup> = *q*3.

Finally, we analyze the resulting activations at the bottleneck by considering the BN encoder (7.66). Note that we assume *<sup>y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* in (7.66) with *<sup>q</sup>* = |*<sup>T</sup>* <sup>|</sup> being the rank of the data matrix *<sup>Y</sup>* <sup>∈</sup> <sup>R</sup>*n*×*<sup>q</sup>* . Thus, the encoder takes a fixed age 0 <sup>≤</sup> *<sup>x</sup>* <sup>≤</sup> 99 and encodes the corresponding time-series observation *<sup>y</sup><sup>x</sup>* <sup>∈</sup> <sup>R</sup>|*<sup>T</sup>* <sup>|</sup> by the bottleneck activations. This parametrization has been inspired by the PCA which typically considers a data matrix that has more rows than columns. This results in at most *q* = rank*(Y)* singular values, supposed *n* ≥ *q*. However, we can easily exchange the role of rows and columns, e.g., by transposing all matrices involved. For mortality forecasting it is advantageous to exchange these roles because we would like to extrapolate a time-series beyond *T* . For this reason we set for the input dimension *<sup>q</sup>*<sup>0</sup> <sup>=</sup> *<sup>q</sup>* <sup>=</sup> 100, which provides us with <sup>|</sup>*<sup>T</sup>* <sup>|</sup> observations *<sup>y</sup><sup>t</sup>* <sup>∈</sup> <sup>R</sup>100. We then fit the BN encoder (7.66) to receive the bottleneck activations

$$Y = (\mathfrak{y}\_l)\_{l \in \mathcal{T}} \mapsto \Phi(Y) = (\Phi(\mathfrak{y}\_l))\_{l \in \mathcal{T}} \in \mathbb{R}^{q\_2 \times |\mathcal{T}|}.$$

Figure 7.39 shows these figures for a bottleneck *q*<sup>2</sup> = 2. We observe that these bottleneck time-series *(!(yt))t*<sup>∈</sup>*<sup>T</sup>* are much more difficult to understand than the LC/RH ones given in Fig. 7.35. Firstly, we see that we have quite some dependence

**Fig. 7.39** BN network *(q*1*, q*2*, q*3*)* <sup>=</sup> *(*20*,* <sup>2</sup>*,* <sup>20</sup>*)*: bottleneck activations showing *!(yt)* <sup>∈</sup> <sup>R</sup><sup>2</sup> for *t* ∈ *T*

between the components of the time-series. Secondly, in contrast to the LC/RH case of Fig. 7.35, there is not one component that dominates. Note that this dominance has been obtained by scaling the components of *(bx)x* to add up to 1 (which, of course, reflects the magnitudes of the singular values). In the non-linear case, these scales are hidden in the decoder which is more difficult to extract. Thirdly, the extrapolation may not work if the time-series has a trend and if we use the hyperbolic tangent activation function that has a bounded range. In general, a trend extrapolation has to be considered very carefully with FN networks with non-linear activation functions, and often there is no good solution to this problem within the FN network framework. We conclude that this approach improves in-sample mortality surface modeling, but it leaves open the question about forecasting the future mortality rates because an extrapolation seems more difficult. -

*Remark 7.32* The concept of BN networks has also been considered in the actuarial literature to encode geographic information, see Blier-Wong et al. [39]. Since geographic information has a natural spatial component, these authors propose to use a convolutional neural network to encode the spatial information before processing the learned features through a BN network. The proposed decoder may have different forms, either it tries to reconstruct the whole (spatial) neighborhood of a given location or it only tries to reconstruct the site of a given location.

## **7.6 Model-Agnostic Tools**

We collect some model-agnostic tools in this section that help us to better understand and analyze the networks, their calibrations and predictions. Model-agnostic tools are techniques that are not specific to a certain model type and can be used for any regression model. Most methods presented here are nicely presented in the tutorial of Lorentzen–Mayer [258]. There are several ways of getting a better understanding of a regression model. First, we can analyze variable importance which tries to answer similar questions to the GLM variable selection tools of Sect. 5.3 on model validation. However, in general, we cannot rely on any asymptotic likelihood theory for such an analysis. Second, we can try to understand the predictive model. For a GLM with the log-link function this is quite simple because the systematic effects are of a multiplicative nature. For networks this is much more complicated because we allow for much more general regression functions. We can either try to understand these functions on a global portfolio level (by averaging the effects over many insurance policies) or we can try to understand these functions locally for individual insurance policies. The latter refers to local sensitivities around a chosen feature value *x* ∈ *X*, and the former to global modelagnostics.

## *7.6.1 Variable Permutation Importance*

For GLMs we have studied the LRT and the Wald test that have been assisting us in reducing the GLM by the feature components that do not contribute sufficiently to the regression task at hand, see Sects. 5.3.2 and 5.3.3. These variable reduction techniques rely on an asymptotic likelihood theory. Here, we need to proceed differently, and we just aim at ranking the variables by their importance, similarly to a drop1 analysis, see Listing 5.6.

For a given FN network regression model

$$
\mathfrak{x} \in \mathcal{X} \mapsto \mu(\mathfrak{x}) = \operatorname{g}^{-1} \langle \mathfrak{f}, \mathfrak{z}^{(d:\mathbb{I})}(\mathfrak{x}) \rangle,
$$

we randomize one component of *x* = *(x*1*,...,xq )* at a time, and we study the resulting change in the objective function. More precisely, for given (learning) data *L*, with features *x*1*,..., xn*, we select one feature component 1 ≤ *j* ≤ *q* and permute *(xi,j )*1≤*i*≤*<sup>n</sup>* randomly across the entire portfolio 1 ≤ *i* ≤ *n*. We denote by *<sup>L</sup>(j )* the resulting data with the *<sup>j</sup>* -th component being permuted. We then compare the resulting deviance loss <sup>D</sup>*(L(j ), μ)* to the one <sup>D</sup>*(L, μ)* on the original data *<sup>L</sup>* using the same regression model *μ*. We call this approach variable permutation importance (VPI). Note that such a permutation does not only act on the marginal effects, but it also distorts the interaction effects of the different feature components.

We calculate the VPI on the MTPL claim frequency data of model Poisson GLM3 of Table 5.5 and the FN network regression model *<sup>μ</sup>m*=<sup>1</sup> of Table 7.9; we use this example throughout this section on model-agnostic tools. Figure 7.40 shows the relative increases

$$\text{vpi}^{(j)} = \frac{\mathfrak{D}(\mathcal{L}^{(j)}, \mu) - \mathfrak{D}(\mathcal{L}, \mu)}{\mathfrak{D}(\mathcal{L}, \mu)}.$$

of the deviance losses by permuting one feature component 1 ≤ *j* ≤ *q* at a time.

Obviously, the BonusMalus level followed by DrivAge and VehBrand are the most important variables according to this VPI method. This is in alignment for both models. Thereafter, there are smaller disagreements between the two models. These disagreements may (also) be caused by a non-optimal feature pre-processing in the GLM where, for instance, we have to add the interaction effects manually, see (5.35). Overall, these VPI results are in line with the findings of the classical methods on GLMs, see for instance the drop1 table in Listing 5.6.

One point that is worth mentioning (and which makes the VPI results not fully reliable) is the use of feature components that are highly correlated. In our case, Density and Area are highly correlated, see Fig. 13.12. Therefore, it may not make sense to randomly permute one component while keeping the other one unchanged. This issue will also arise in other methods described below.

*Remark 7.33 (Global Surrogate Model)* There are other machine learning methods that offer different measures of variable importance. For instance, (binary split) classification and regression trees (CARTs) offer popular methods for measuring variable importance; for binary split CARTs we refer to Breiman et al. [54] and Denuit et al. [100]. These CARTs select individual feature components for partitioning the feature space *X*, and variable importance is measured by analyzing the contribution of each feature component to the total decrease of the objective function. Binary split CARTs have the advantage that this can be done in an additive way.

More complex regression models like FN networks can then be analyzed by using a binary split regression tree as a global surrogate model. That is, we can fit a CART to the network regression function (as a surrogate model) and then analyze variable importance in this surrogate regression tree model using the tools of regression trees. We will not give an explicit example here because we have not formally introduced regression trees in this manuscript, but this concept is fairly straightforward and well-understood.

## *7.6.2 Partial Dependence Plots*

There are several graphical tools that study the individual behavior in the feature components. Some of these tools select individual insurance policies and others study global portfolio properties. They have in common that they are based on marginal considerations, i.e., some sort of projection.

#### **Individual Conditional Expectation**

Individual conditional expectation (ICE) selects individual insurance policies *(Yi, xi, vi)* and varies the feature components of *x<sup>i</sup>* over their entire domain; we refer to Goldstein et al. [164]. Similarly to the VPI of Sect. 7.6.1, ICE does not respect collinearity in feature components, but it is rather an isolated view of individual components.

In Fig. 7.41 we provide the ICE plots of model Poisson GLM3 of Table 5.5 and the FN network regression model *<sup>μ</sup>m*=<sup>1</sup> of Table 7.9 of 100 randomly selected insurance policies *xi*. For these randomly selected insurance policies we let the variable DrivAge vary over its domain {18*,...,* 90}. Each color corresponds to one insurance policy *i*, and the colors in the two plots coincide. In the GLM we observe that the lines are roughly parallel which reflects that we have an additive regression structure on the canonical scale (note that these plots are on the canonical parameter scale). The lines are not perfectly parallel because we allow for an interaction between DrivAge and BonusMalus in model Poisson GLM3, see (5.35). The plot of the FN network is more difficult to interpret. Overall the levels (colors) coincide in the two plots, but in the FN network plot the lines are not increasing for ages approaching 18, the reason for this is that we have interactions with other feature components that are important. In particular, for ages close to 18 we cannot have a BonusMalus level of 50% and, therefore, the FN network cannot be trained on this part of the feature space. Nevertheless, the ICE plot allows for such feature configurations (by just extrapolating the FN network regression function beyond the set of available insurance policies). This difficulty is confirmed

**individual conditional expectations (ICE): FN network**

**Fig. 7.41** ICE plots of 100 randomly selected insurance policies *x<sup>i</sup>* of (lhs) model Poisson GLM3 and (rhs) FN network *μm*=<sup>1</sup> letting the variable DrivAge vary over its domain; the *<sup>y</sup>*-axis is on the canonical parameter scale

by exploiting the same plot only on insurance policies that have a BonusMalus level of at least 100%. In that case the lines for small ages are non-decreasing when approaching the age of 18, thus, providing a more reasonable interpretation. We conclude that if we have strong dependence and/or interactions between the feature components this method may not provide any reasonable interpretations.

#### **Partial Dependence Plot**

Partial dependence plots (PDPs) have been introduced by Friedman [141], see also Zhao–Hastie [405]. PDPs are closely related to the do-operator in causal inference in statistics; we refer to Pearl [298] and Pearl et al. [299] for the do-operator. A PDP and the do-operator, respectively, are obtained by breaking the dependence structure between different feature components. Namely, we decompose the feature *x* = *(xj , x*\*<sup>j</sup> )* into two parts with *x*\*<sup>j</sup>* denoting all feature components except of component *xj* ; we will use a slight abuse of notation because the components need to be permuted correspondingly in the following regression function *x* → *μ(x)* = *μ(xj , x*\*<sup>j</sup> )*. Since, typically, there is dependence between *xj* and *x*\*<sup>j</sup>* one can infer *x*\*<sup>j</sup>* from *xj* , and vice versa. A PDP breaks this inference potential so that the sensitivity can be studied purely in *xj* . In particular, the partial dependence profile is obtained by

$$
\mu\_j \leftrightarrow \,\,\bar{\mu}^j(\mathbf{x}\_j) = \int \mu(\mathbf{x}\_j, \mathbf{x}\_{\backslash j}) \, dp(\mathbf{x}\_{\backslash j}), \tag{7.67}
$$

where *p(x*\*<sup>j</sup> )* is the marginal (portfolio) distribution of the feature components *x*\*<sup>j</sup>* . Observe that this differs from the conditional expectation which reads as

$$\mathbf{x}\_{\rangle} \mapsto \mu(\mathbf{x}\_{\rangle}) = \mathbb{E}\_{p} \left[ \mu(\mathbf{x}\_{\rangle}, \mathbf{x}\_{\langle j \rangle}) \middle| \mathbf{x}\_{j} \right] = \int \mu(\mathbf{x}\_{\rangle}, \mathbf{x}\_{\langle j \rangle}) \, dp(\mathbf{x}\_{\langle j \rangle} | \mathbf{x}\_{j}),$$

the latter allowing for inferring *x*\*<sup>j</sup>* from *xj* through the conditional probability *dp(x*\*<sup>j</sup>* |*xj )*.

*Remark 7.34 (Discrimination-Free Insurance Pricing)* Recent actuarial literature discusses discrimination-free insurance pricing which aims at developing a pricing framework that is free of discrimination w.r.t. so-called protected characteristics such as gender and ethnicity; we refer to Guillén [174], Chen et al. [69, 70], Lindholm et al. [253] and Frees–Huang [136] for discussions on discrimination in insurance. In general, part of the problem also lies in the fact that one can often infer the protected characteristics from the non-protected feature information. This is called indirect discrimination or proxy discrimination. The proposal of Lindholm et al. [253] for achieving discrimination-free prices exactly follows the construction (7.67), by breaking the link, which infers the protected characteristics from the non-protected ones.

The partial dependence profile on our portfolio *L* with given features *x*1*,..., x<sup>n</sup>* is now obtained by just using the portfolio distribution as an empirical distribution for *p* in (7.67). That is, for a selected component *xj* of *x*, we consider the partial dependence profile

$$\mathbf{x}\_j \mapsto \bar{\mu}^j(\mathbf{x}\_j) = \frac{1}{n} \sum\_{i=1}^n \mu(\mathbf{x}\_j, \mathbf{x}\_{i, \backslash j}) = \frac{1}{n} \sum\_{i=1}^n \mu\left(\mathbf{x}\_{i, 0}, \mathbf{x}\_{i, 1}, \dots, \mathbf{x}\_{i, j-1}, \mathbf{x}\_j, \mathbf{x}\_{i, j+1}, \dots, \mathbf{x}\_{i, q}\right),$$

thus, we average the ICE plots over *xi,*\*<sup>j</sup>* of our portfolio 1 ≤ *i* ≤ *n*.

Figure 7.42 (lhs, middle) give the PDPs of the variables BonusMalus and DrivAge of model Poisson GLM3 and the FN network *<sup>μ</sup>m*=1. Overall they

**Fig. 7.42** PDPs of (lhs) BonusMalus level and (middle) DrivAge; the *y*-axis is on the canonical parameter scale; (rhs) ratio of policies with a bonus-malus level of 50% per driver's age

look reasonable. However, we are again facing the difficulty that these partial dependence profiles consider feature configurations that should not appear in our portfolio. Roughly 57% of all insurance policies have a bonus-malus level of 50%, which means that these driver's did not suffer any claims in the past couple of years. Obviously a driver of age 18 cannot be on this bonus-malus level, simply because she/he is not in a state where she/he can have multiple years of driving experience without an accident. However, the PDP does not respect this fact, and just extrapolates the regression function into that part of the feature space. Therefore, the PDP at driver's age 18 is based on 57% of the insurance policies being on a bonusmalus level of 50% because this corresponds to the empirical portfolio distribution *p(x*\*<sup>j</sup> )* excluding the driver's age *xj* = DrivAge information. Figure 7.42 (rhs) shows the ratio of insurance policies that have a bonus-malus level of 50%. We observe that this ratio is roughly zero up to age 28 (orange vertical dotted line), which indicates that a driver needs 10 successive accident-free years to reach the lowest bonus-malus level (starting from 100%). We consider it to be data error that this ratio below age 28 is not identically equal to zero. We conclude that these PDPs need to be interpreted very carefully because the insurance portfolio is not uniformly distributed across the feature space. In some parts of the feature space the regression function *x* → *μ(x)* may not even be well-defined because certain combinations of feature values *x* may not exist (e.g., a driver of age 18 on bonus-malus level 50% or a boy at a girl's college).

#### **Accumulated Local Effects Profile**

PDPs have the problem that they do not respect the dependencies between the feature components, as explained in the previous paragraphs. The accumulated local effects (ALE) profile tries to take account for these dependencies by only studying a local feature perturbation, we refer to Apley–Zhu [13]. We present a smooth (gradient-based) version of ALE because our regression functions are differentiable. Consider the local effect in the individual feature *x* w.r.t. the component *xj* by studying the partial derivative

$$
\mu\_j(\mathbf{x}) = \frac{\partial \mu(\mathbf{x})}{\partial x\_j}.\tag{7.68}
$$

The average local effect of component *j* is received by

$$
\mu\_j \mapsto \Delta\_j(\mathbf{x}\_j; \mu) = \int \mu\_j(\mathbf{x}\_j, \mathbf{x}\_{\bigvee}) dp(\mathbf{x}\_{\bigvee}|\mathbf{x}\_j). \tag{7.69}
$$

ALE integrate the average local effects *j (*·*)* over their domain, and the ALE profile is defined by

$$\mu\_j \mapsto \int\_{x\_{j\_0}}^{x\_j} \Delta\_j(z\_j; \mu) dz\_j = \int\_{x\_{j\_0}}^{x\_j} \int \mu\_j(z\_j, \mathbf{x}\_{\backslash j}) dp(\mathbf{x}\_{\backslash j}|z\_j) dz\_j,\tag{7.70}$$

where *xj*<sup>0</sup> is a given initialization point. The difference between PDPs and ALE is that the latter correctly considers the dependence structure between *xj* and *x*\*<sup>j</sup>* , see (7.69).

**Listing 7.10** Local effects through the gradients of FN networks in keras [77]

```
1 Input = layer_input(shape = c(11), dtype = 'float32', name = 'Design')
2 #
3 Output = Input %>%
4 layer_dense(units=20, activation='tanh', name='FNLayer1') %>%
5 layer_dense(units=15, activation='tanh', name='FNLayer2') %>%
6 layer_dense(units=10, activation='tanh', name='FNLayer3') %>%
7 layer_dense(units=1, activation='linear', name='Network')
8 #
9 model = keras_model(inputs = c(Input), outputs = c(Output))
10 #
11 grad = Output %>%
12 layer_lambda(function(x) k_gradients(model$outputs, model$inputs))
13 model.grad = keras_model(inputs = c(Input), outputs = c(grad))
14 theta.grad <- data.frame(model.grad %>% predict(XX))
```
*Example* We come back to our MTPL claim frequency FN network example. The local effects (7.68) can directly be calculated in the R library keras [77] for a FN network, see Listing 7.10. In order to do so we need to drop the embedding layers, compared to Listing 7.4, and directly work on the learned embeddings. This gives an input layer of dimension *q* = 7 + 2 + 2 = 11 because we have two categorical features that have been embedded into 2-dimensional Euclidean spaces R2. Then, we can formally calculate the gradient of the FN network w.r.t. its inputs which is done on lines 11–13 of Listing 7.10. Remark that we work on the canonical scale because we use the linear activation function on line 7 of the listing.

There remain the averaging (7.69) and the integration (7.70) which can be done empirically

$$\mathbf{x}\_{j} \mapsto \Delta\_{j}(\mathbf{x}\_{j}; \boldsymbol{\mu}) = \frac{1}{|\mathcal{E}(\mathbf{x}\_{j})|} \sum\_{i \in \mathcal{E}(\mathbf{x}\_{j})} \mu\_{j}(\mathbf{x}\_{i}),\tag{7.71}$$

where *E(xj )* denotes the indices *i* of all cases *xi*, 1 ≤ *i* ≤ *n*, with *xi,j* = *xj* , assuming of having discrete feature data observations. Note that this empirical averaging respects the dependence within *x*. The (uncentered) ALE profile is then obtained by aggregating these local effects, that is,

$$
\mu\_j \mapsto \tilde{\mu}\_j(\mathbf{x}\_j) = \int\_{\mathbf{x}\_{j\_0}}^{\mathbf{x}\_j} \Delta\_j(\mathbf{z}\_j; \mu) d\mathbf{z}\_j,
$$

where this integration is typically understood in a discrete sense because the observed feature components *xi,j* are discrete. Often, this uncentered ALE profile is still translated (centered) by the portfolio average.

#### *Remarks 7.35*


$$
\theta\_j(\mathbf{x}) = \frac{\partial \theta(\mathbf{x})}{\partial \mathbf{x}\_j} = \beta\_j \equiv \Delta\_j(\mathbf{x}\_j; \theta).
$$

In the case of model Poisson GLM3 presented in Sect. 5.3.4 the situation is more delicate as we model the interactions in the GLM as follows, see (5.34) and (5.35),

$$\begin{split} & (\texttt{DirivAge}, \texttt{BonusMalus}) \\ & \mapsto \beta\_l \, \texttt{DirivAge} + \beta\_{l+1} \log(\texttt{DirivAge}) + \sum\_{j=2}^{4} \beta\_{l+j} (\texttt{DirivAge})^{j} \\ & \quad + \beta\_{l+5} \texttt{BonusMalus} + \beta\_{l+6} \texttt{BonusMalus} \cdot \texttt{DirivAge} \\ & \quad + \beta\_{l+7} \texttt{BonusMalus} \cdot \left(\texttt{DirivAge}\right)^{2} .\end{split}$$

In that case, though we work with a GLM, the resulting local effects are different if we calculate the derivatives w.r.t. DrivAge and BonusMalus, respectively, because we explicitly (manually) include non-linear effects into the GLM.

Figure 7.43 shows the ALE profiles of the variables BonusMalus and DrivAge. The shapes of these profiles can directly be compared to the PDPs of Fig. 7.42 (the scale on the *y*-axis should be ignored because this will depend on the applied centering, however, we hold on to the canonical scale). The main difference between these two plots can be observed for the variable DrivAge at low ages. Namely, the ALE profiles have a different shape at low ages respecting the dependencies in the feature components by only considering real local feature configurations.

**Fig. 7.43** ALE profiles of (lhs) BonusMalus level and (rhs) DrivAge; the *y*-axis is on the log-scale

## *7.6.3 Interaction Strength*

Next we are going to discuss pairwise interaction strength. Friedman–Popescu [143] made the following proposal. Roughly speaking, there is an interaction between the two feature components *xj* and *xk* of *x* in the regression function *x* → *μ(x)* if

$$
\mu\_{j,k}(\mathbf{x}) = \frac{\partial^2 \mu(\mathbf{x})}{\partial \mathbf{x}\_j \partial \mathbf{x}\_k} \neq 0. \tag{7.72}
$$

This means that the magnitude of a change of the regression function *μ(x)* in *xj* depends on the current value of *xk*. If there is no such interaction, we can additively decompose the regression function *μ(x)* into two independent terms. This then reads as *μ(x)* = *μ*\*<sup>j</sup> (x*\*<sup>j</sup> )* + *μ*\*k(x*\*k)*. This motivation is now applied to the PDP profiles given in (7.67). We define the centered versions *xj* → ˘*μ<sup>j</sup> (xj )* and *xk* → ˘*μk(xk)* of the PDP profiles by centering the PDP profiles *xj* → ¯*μ<sup>j</sup> (xj )* and *xk* → ¯*μk(xk)* over the portfolio values *<sup>x</sup>i*, 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*. Next, we consider an analogous two-dimensional version for *(xj , xk)*. Let *(xj , xk)* → ˘*μj,k(xj , xk)* be the centered version of a two-dimensional PDP profile *(xj , xk)* → ¯*μj,k(xj , xk)*.

Friedman's *H*-statistics measures the pairwise interaction strength between the components *xj* and *xk*, and it is defined by

$$\left| H\_{j,k}^{2} = \frac{\sum\_{l=1}^{n} \left( \check{\mu}^{j,k}(\mathbf{x}\_{l,j}, \mathbf{x}\_{l,k}) - \check{\mu}^{j}(\mathbf{x}\_{l,j}) - \check{\mu}^{k}(\mathbf{x}\_{l,k}) \right)^{2}}{\sum\_{l=1}^{n} \check{\mu}^{j,k}(\mathbf{x}\_{l,j}, \mathbf{x}\_{l,k})^{2}},\tag{7.73}$$

we refer to formula (44) in Friedman–Popescu [143]. While *H*<sup>2</sup> *j,k* measures the proportion of the joint interaction effect, as we normalize by the variability of the joint effect *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>μ</sup>*˘ *j,k (xi,j , xi,k )* 2, sometimes also the absolute measure is considered by taking the square root of the enumerator in (7.73). Of course, this can be extended to interactions of three components, etc., we refer to Friedman– Popescu [143].

We do not give an example here, because calculating Friedman's *H*-statistics can be computationally demanding if one has many feature components with many levels in FN network modeling.

## *7.6.4 Local Model-Agnostic Methods*

The above methods like the PDP and the ALE profile have been analyzing the global behavior of the regression functions. We briefly mention some tools that describe the local sensitivity and explanation of regression results.

Probably the most popular method is the locally interpretable model-agnostic explanation (LIME) introduced by Ribeiro et al. [311]. This analyzes locally the expected response of a given feature *x* by perturbing *x*. In a nutshell, the idea is to select an environment *E(x)* ⊂ *X* of a chosen feature *x* and to study the regression function *x* → *μ(x )* in this environment *x* ∈ *E(x)*. This is done by fitting a (much) simpler surrogate model to *μ* on this environment *E(x)*. If the environment is small, often a linear regression model is chosen. This then allows one to interpret the regression function *μ(*·*)* locally using the simpler surrogate model, and if we have a high-dimensional feature space, this linear regression is complemented with LASSO regularization to only select the most important feature components.

The second method considered in the literature is the Shapley additive explanation (SHAP). The SHAP is based on Shapley values [335] which is a method of allocating rewards to players in cooperative games, where a team of individual players jointly contributes to a potential success. Shapley values solve this allocation problem under the requirements of additivity and fairness. This concept can be translated to analyzing how individual feature components of *x* contribute to the total prediction *μ(x)* of a given case. Shapley values allow one to do such a contribution analysis in the aforementioned additive and fair way, see Lundberg–Lee [261]. The calculation of SHAP values is combinatorially demanding and therefore several approximations have been proposed, many of them having their own caveats, we refer to Aas et al. [1]. We will not further consider these but refer to the relevant literature.

## *7.6.5 Marginal Attribution by Conditioning on Quantiles*

The above model-agnostic tools have mainly been studying the sensitivities of the expected response *μ(x)* in the feature components of *x*. This becomes apparent from considering the partial derivatives(7.68) to calculate the local effects. Alternatively, we could try to understand how the feature components of *x* contribute to a given response *μ(x)*, see Ancona et al. [12]; this section follows Merz et al. [273]. The marginal attribution on an input component *j* of the response *μ(x)* can be studied by the directional derivative

$$
\mu\_j \leftrightarrow \ x\_j \mu\_j(\mathbf{x}) = \ x\_j \frac{\partial \mu(\mathbf{x})}{\partial \mathbf{x}\_j}.\tag{7.74}
$$

This was first proposed to the data science community by Shrikumar et al. [340]. Basically, it means that we replace the partial derivative *μj (x)* by the directional derivative along the vector *xj e<sup>j</sup>* = *(*0*,...,* 0*, xj ,* 0*,...,* 0*)*-<sup>∈</sup> <sup>R</sup>*q*+<sup>1</sup>

$$\begin{split} & \lim\_{\epsilon \to 0} \frac{\mu(\mathbf{x} + \epsilon \mathbf{x}\_{j} \mathbf{e}\_{j}) - \mu(\mathbf{x})}{\epsilon} \\ & = \lim\_{\epsilon \to 0} \frac{\mu\left((1, \mathbf{x}\_{1}, \dots, \mathbf{x}\_{j-1}, (1 + \epsilon)\mathbf{x}\_{j}, \mathbf{x}\_{j+1}, \dots, \mathbf{x}\_{q})^{\top}\right) - \mu(\mathbf{x})}{\epsilon} = \mathbf{x}\_{j} \mu\_{j}(\mathbf{x}), \end{split}$$

where *<sup>e</sup><sup>j</sup>* is the *(j* <sup>+</sup> <sup>1</sup>*)*-st basis vector in <sup>R</sup>*q*+<sup>1</sup> (index *<sup>j</sup>* <sup>=</sup> 0 corresponds to the intercept component *x*<sup>0</sup> = 1).

We start by recalling the sensitivity analysis of Hong [189] and Tsanakas– Millossovich [355] in the context of risk measurement. Assume the features have a portfolio distribution *X* ∼ *p*. This describes the random selection of an insurance policy *X* = *x* from the portfolio described by *p*. The average price over the entire portfolio is then given by

$$
\mu = \mathbb{E}\_p[\mu(X)] = \int \mu(\mathfrak{x}) dp(\mathfrak{x}).
$$

We implicitly interpret *μ(X)* <sup>=</sup> <sup>E</sup>[*<sup>Y</sup>* <sup>|</sup>*X*] as the price of the response *<sup>Y</sup>* , here, though we do not need the response distribution in this section. Assume *μ(X)* has a continuous distribution function *Fμ(<sup>X</sup>)*; and we drop the intercept component *X*<sup>0</sup> = *x*<sup>0</sup> = 1 from these considerations (but we still keep it in the regression model). This implies that *Uμ(<sup>X</sup>)* = *Fμ(X)(μ(X))* is uniformly distributed on [0*,* 1]. Choosing a density *ζ* on [0*,* 1] gives us a probability distortion *ζ (Uμ(X))* as we have the normalization

$$\mathbb{E}\_p\left[\zeta(U\_{\mu(X)})\right] = \int\_0^1 \zeta(u) du = 1.$$

This allows us to define a distorted portfolio price in the sense of a Radon–Nikodým derivative, namely, we set for the distorted portfolio price

$$\varrho(\mu(X);\xi) = \mathbb{E}\_p\left[\mu(X)\xi(U\_{\mu(X)})\right].$$

This functional *(μ(X)*; *ζ )* is a so-called distortion risk measure. Our goal is to study the sensitivities of this distortion risk measure in the components of *X*. Assume existence of the following directional derivatives for all 1 ≤ *j* ≤ *q*

$$S\_f(\mu; \boldsymbol{\xi}) = \frac{\partial}{\partial \epsilon} \, \varrho \left( \mu \left( (\mathbb{I}, X\_1, \dots, X\_{j-1}, (1 + \epsilon)X\_j, X\_{j+1}, \dots, X\_q)^\top \right); \boldsymbol{\xi} \right) \Big|\_{\epsilon = 0} \dots$$

*Sj (μ*; *ζ )* can be used to describe the sensitivities of the regression function *X* → *μ(X)* in the feature components *Xj* . Under different sets of assumptions, Hong [189] and Tsanakas–Millossovich [355] have proved the following identity

$$S\_j(\mu;\zeta) = \mathbb{E}\_p\left[X\_j \mu\_j(X)\zeta(U\_{\mu(X)})\right],$$

the right-hand side exactly uses the marginal attribution (7.74). There remains the freedom of the choice of the density *ζ* on [0*,* 1], which allows us to study the sensitivities of different distortion risk measures. For the uniform distribution *ζ* ≡ 1 on [0*,* 1] we simply have the average (best-estimate) price and its average marginal attributions

$$\varrho(\mu(X);\zeta \equiv 1) = \mathbb{E}\_p[\mu(X)] = \mu \qquad \text{and} \qquad \mathbb{S}\_j(\mu;\zeta \equiv 1) = \mathbb{E}\_p[X\_j \mu\_j(X)].$$

If we want to consider a quantile risk measure, called value-at-risk (VaR), we choose a Dirac measure for the density *ζ* . That is, choose a point measure of mass 1 in *α* ∈ *(*0*,* 1*)*, i.e., the density *ζ* is concentrated in the single point *α*. In that case, the event {*Fμ(X)(μ(X))* = *Uμ(<sup>X</sup>)* = *α*} receives probability one, and therefore we have the *α*-quantile

$$
\varrho(\mu(X); \alpha) = F\_{\mu(X)}^{-1}(\alpha),
$$

and the corresponding sensitivities for 1 ≤ *j* ≤ *q*

$$S\_j(\mu;\alpha) = \mathbb{E}\_p\left[X\_j \mu\_j(\mathbf{X}) \, \middle| \, \mu(\mathbf{X}) = F^{-1}\_{\mu(\mathbf{X})}(\alpha)\right].\tag{7.75}$$

#### *Remarks 7.36*


conditional probability, conditioned on the event {*μ(X)* <sup>=</sup> *<sup>F</sup>* <sup>−</sup><sup>1</sup> *μ(X) (α)*}. This is done with a local smoother similarly to Listing 7.8.

In analogy to Merz et al. [273] we give a different interpretation to the sensitivities (7.75), which allows us to further expand this formula. We have 1st order Taylor expansion

$$
\mu(\mathfrak{x} + \mathfrak{e}) = \mu(\mathfrak{x}) + \left(\nabla\_{\mathfrak{x}}\mu(\mathfrak{x})\right)^{\bigvee}\mathfrak{e} + o\left(\|\mathfrak{e}\|\_{2}\right) \qquad \text{for } \|\mathfrak{e}\|\_{2} \to 0.
$$

Obviously, this is a local approximation in *x*. Setting = −*x*, we get (a possibly crude) approximation

$$
\mu(\mathbf{0}) \approx \mu\left(\mathbf{x}\right) - \left(\nabla\_{\mathbf{x}}\mu(\mathbf{x})\right)^{\top}\mathbf{x}.
$$

By bringing the gradient term to the other side, using (7.75) and conditionally averaging, we receive the 1st order marginal attributions

$$F^{-1}\_{\mu(X)}(\alpha) = \mathbb{E}\_p\left[\mu\left(X\right)\left|\mu(X) = F^{-1}\_{\mu(X)}(\alpha)\right.\right] \approx \mu\left(\mathbf{0}\right) + \sum\_{j=1}^q S\_j\left(\mu;\alpha\right). \tag{7.76}$$

Thus, the sensitivities *Sj (μ*; *α)* provide a 1st order description of the quantiles *F* <sup>−</sup><sup>1</sup> *μ(X) (α)* of *μ(X)*. We call this approach marginal attribution by conditioning on quantiles (MACQ) because it shows how the components *Xj* of *X* contribute to a given quantile level.

*Example 7.37 (MACQ for Linear Regression)* The simplest case is the linear regression case because the 1st order marginal attributions (7.76) are exact in this case. Consider a linear regression function with regression parameter *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*+<sup>1</sup>

$$\mu \mapsto \mu(\mathbf{x}) = \langle \mathfrak{B}, \mathbf{x} \rangle = \beta\_0 + \sum\_{j=1}^{q} \beta\_j x\_j.$$

The 1st order marginal attributions for fixed *α* ∈ *(*0*,* 1*)* are given by

$$\begin{aligned} F\_{\mu(X)}^{-1}(\alpha) &= \mu \left( \mathbf{0} \right) + \sum\_{j=1}^{q} S\_j(\mu; \alpha) \\ &= \beta\_0 + \sum\_{j=1}^{q} \beta\_j \mathbb{E}\_p \left[ X\_j \, \middle| \, \mu(X) = F\_{\mu(X)}^{-1}(\alpha) \right]. \end{aligned} \tag{7.77}$$

That is, we replace the feature components *Xj* by their expected contributions on a given quantile level *F* <sup>−</sup><sup>1</sup> *μ(X) (α)* in (7.77). We compare this explanation to the ALE profile (7.70). Set initial value *xj*<sup>0</sup> = 0, the ALE profile for the linear regression model is given by

$$\mathbf{x}\_{j} \mapsto \int\_{0}^{\mathbf{x}\_{j}} \Delta\_{j}(\mathbf{z}\_{j}) d\mathbf{z}\_{j} = \beta\_{j}\mathbf{x}\_{j}.$$

This is the sensitivity of the linear regression function in component *xj* , whereas (7.77) describes the contribution of each feature component to an expected response level *μ(x)*, in particular, <sup>E</sup>*p*[*Xj* <sup>|</sup>*μ(X)* <sup>=</sup> *<sup>F</sup>* <sup>−</sup><sup>1</sup> *μ(X) (α)*] describes the average feature value in component *j* on a given quantile level. -

A natural next step is to expand the 1st order attributions to 2nd orders. This allows us to consider the interaction terms. Consider the 2nd order Taylor expansion

$$
\mu(\mathfrak{x} + \mathfrak{e}) = \mu(\mathfrak{x}) + \left(\nabla\_{\mathfrak{x}}\mu(\mathfrak{x})\right)^{\top}\mathfrak{e} + \frac{1}{2}\mathfrak{e}^{\top}\nabla\_{\mathfrak{x}}^{2}\mu(\mathfrak{x})\mathfrak{e} + o(\|\mathfrak{e}\|\_{2}^{2}) \qquad \text{for } \|\mathfrak{e}\|\_{2} \to 0.
$$

Similar to (7.76), setting = −*x*, this gives us the 2nd order marginal attributions

$$F\_{\mu(\mathbf{X})}^{-1}(\alpha) \approx \mu\left(\mathbf{0}\right) + \sum\_{j=1}^{q} S\_j(\mu; \alpha) - \frac{1}{2} \sum\_{j,k=1}^{q} T\_{j,k}(\mu; \alpha) \tag{7.78}$$

$$= \mu\left(\mathbf{0}\right) + \sum\_{j=1}^{q} \left(S\_j(\mu; \alpha) - \frac{1}{2} T\_{j,j}\left(\mu; \alpha\right)\right) - \sum\_{1 \le j < k \le q} T\_{j,k}(\mu; \alpha),$$

where for 1 ≤ *j,k* ≤ *q* we define *μj,k(x)* = *∂xj ∂xkμ(x)*, see (7.72), and

$$T\_{j,k}(\mu;\alpha) = \mathbb{E}\_p\left[X\_j X\_k \mu\_{j,k}(X) \, \middle| \, \mu(X) = F^{-1}\_{\mu(X)}(a) \right]. \tag{7.79}$$

#### *Remarks 7.38*


• Interestingly, we can precisely evaluate the accuracy of approximation (7.78) by analyzing for a given regression function *μ(*·*)*

$$\sup\_{\alpha \in (0,1)} \left| F^{-1}\_{\mu(X)}(\alpha) - \mu \left( \mathbf{0} \right) - \sum\_{j=1}^{q} S\_j \left( \mu; \alpha \right) + \frac{1}{2} \sum\_{j,k=1}^{q} T\_{j,k} \left( \mu; \alpha \right) \right|. \tag{7.80}$$

Intuitively, in order to have a uniform good approximation, the origin **0** should be somehow centered in the feature distribution *X* ∼ *p*. This will be studied next.

Above we have implicitly assumed that **0** is a suitable reference point that makes the approximation error (7.80) small. For FN network fitting we typically normalize the features either using the MinMaxScaler (7.29) or we center and normalize the components of *(xi)*<sup>1</sup>≤*i*≤*<sup>n</sup>* according to (7.30). That is, the reference point is chosen such that the gradient descent fitting works efficiently. However, this may not be an optimal reference point for studying the 2nd order attributions. Therefore, we analyze this question in more detail, and the following reparametrization can still be done after model fitting.

If we choose an arbitrary translation *<sup>a</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* , we can set <sup>=</sup> *<sup>a</sup>* <sup>−</sup> *<sup>x</sup>* in the above 2nd order Taylor expansion to receive another 2nd order marginal attribution representation

$$F^{-1}\_{\mu(X)}(a) \approx \mu\left(a\right) - \mathbb{E}\_p\left[ (a - X)^\top \nabla\_\mathbf{x} \mu(X) \left| \mu(X) = F^{-1}\_{\mu(X)}(a) \right. \right] \tag{7.81}$$

$$ - \frac{1}{2} \mathbb{E}\_p\left[ (a - X)^\top \nabla\_\mathbf{x}^2 \mu(X) (a - X) \left| \mu(X) = F^{-1}\_{\mu(X)}(a) \right. \right].$$

Essentially, this means that we shift the feature distribution *p* to considering the shifted random vectors *<sup>X</sup><sup>a</sup>* <sup>=</sup> *<sup>X</sup>* <sup>−</sup> *<sup>a</sup>* and while setting *<sup>μ</sup>a(*·*)* <sup>=</sup> *μ(<sup>a</sup>* + ·*)*, thus, this simply says that we pre-process the features differently. In view of approximation (7.81) we can now select a reference point *<sup>a</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* that makes the 2nd order marginal attributions as precise as possible. Define the events *A<sup>l</sup>* = {*μ(X)* = *F* <sup>−</sup><sup>1</sup> *μ(X) (αl)*} for a discrete quantile grid 0 *< α*<sup>1</sup> *< ... < αL <* 1. We define the objective function

$$\mathfrak{a} \mapsto G(\mathfrak{a}; \mu) = \sum\_{l=1}^{L} \left( F\_{\mu(X)}^{-1}(\mathfrak{a}) - \mu \left( \mathfrak{a} \right) + \mathbb{E}\_{p} \left[ (\mathfrak{a} - X)^{\top} \nabla\_{\mathbf{x}} \mu(X) \, \middle| \, \mathcal{A}\_{l} \right] \right) (7.82)$$

$$+ \frac{1}{2} \mathbb{E}\_{p} \left[ (\mathfrak{a} - X)^{\top} \nabla\_{\mathbf{x}}^{2} \mu(X) (\mathfrak{a} - X)^{\top} \Big| \, \mathcal{A} \right] \Big)^{2}.$$

Making this objective function *G(a*; *μ)* small in *a* will provide us with a good reference point for the selected quantile levels *(αl)*<sup>1</sup>≤*l*≤*L*; this is exactly the MACQ proposal of Merz et al. [273]. A local minimum can be found by applying a gradient descent algorithm

$$\mathfrak{a}^{(t)} \mapsto \mathfrak{a}^{(t+1)} = \mathfrak{a}^{(t)} - \delta\_{t+1} \nabla\_{\mathfrak{a}} G(\mathfrak{a}^{(t)}; \mu),$$

for tempered learning rates *δt*+<sup>1</sup> *>* 0. The gradient of *G* w.r.t. *a* is given by

$$\begin{split} \nabla\_{\mathbf{z}} G(\mathbf{z}; \boldsymbol{\mu}) &= 2 \sum\_{l=1}^{L} \left( F\_{\mu(\mathbf{X})}^{-1}(\boldsymbol{\alpha}) - \boldsymbol{\mu} \left( \mathbf{a} \right) + \mathbb{E}\_{p} \left[ (\mathbf{a} - \mathbf{X})^{\top} \nabla\_{\mathbf{x}} \mu(\mathbf{X}) \Big| \mathcal{A}\_{l} \right] \right. \\ & \left. \left. + \frac{1}{2} \mathbb{E}\_{p} \left[ (\mathbf{a} - \mathbf{X})^{\top} \nabla\_{\mathbf{x}}^{2} \mu(\mathbf{X}) (\mathbf{a} - \mathbf{X})^{\top} \Big| \mathcal{A}\_{l} \right] \right) \\ & \times \Big( - \nabla\_{\mathbf{a}} \mu \left( \mathbf{a} \right) + \mathbb{E}\_{p} \left[ \nabla\_{\mathbf{x}} \mu(\mathbf{X}) \,|\, \mathcal{A}\_{l} \right] \\ & \qquad \cdot \Big[ - \mathbb{E}\_{p} \left[ \mathbf{X}^{\top} \nabla\_{\mathbf{x}}^{2} \mu(\mathbf{X}) \Big| \mathcal{A}\_{l} \right] + \frac{1}{2} \mathbf{a}^{\top} \mathbb{E}\_{p} \left[ \nabla\_{\mathbf{x}}^{2} \mu(\mathbf{X}) \,|\, \mathcal{A}\_{l} \right] \Big]. \end{split}$$

All subsequent considerations and interpretations are done w.r.t. an optimal reference point *<sup>a</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* by minimizing the objective function (7.82) on the chosen quantile grid. Mathematically speaking, this optimal choice is w.l.o.g. because the origin **0** of the coordinate system of the feature space *X* is arbitrary, and any other origin can be chosen by a translation, see formula (7.81) and the subsequent discussion. For interpretations, however, the choice of the reference point *a* matters because the directional derivative *Xjμj (X)* can be small either because *Xj* is small or because *μj (X)* is small. Having a small *Xj* means that this feature value is close to the chosen reference point.

*Example 7.39 (MACQ Analysis)* We revisit the MTPL claim frequency example using the FN network regression model of depth *d* = 3 having *(q*1*, q*2*, q*3*)* = *(*20*,* 15*,* 10*)* neurons. Importantly, we use the hyperbolic tangent as the activation function in the FN layers which provides smoothness of the regression function. Figure 7.40 shows the VPI plot of this fitted model. Obviously, the variable BonusMalus plays the most important role in this predictive model. Remark that the VPI plot does not properly respect the dependence structure in the features as it independently permutes each feature component at a time. The aim in this example is to determine variable importance by doing the MACQ analysis (7.78).

Figure 7.44 (lhs) shows the empirical density of the fitted canonical parameter *θ (xi)*, 1 ≤ *i* ≤ *n*; all plots in this example refer to the canonical scale. We then minimize the objective function (7.82) which provides us with an optimal reference point *<sup>a</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* ; we choose equidistant quantile grid 1% *<sup>&</sup>lt;* 2% *< ... <* 99% and all conditional expectations in ∇*aG(a*; *μ)* are empirically approximated by a local smoother similar to Listing 7.8. Figure 7.44 (rhs) gives the resulting marginal attributions w.r.t. this reference point. The orange line shows the 1st order marginal

**Fig. 7.44** (lhs) Empirical density of the fitted canonical parameter *θ (xi)*, 1 ≤ *i* ≤ *n*, (rhs) 1st and 2nd order marginal attributions

**Fig. 7.45** (lhs) Second order marginal attributions *Sj (μ*; *α)* <sup>−</sup> <sup>1</sup> <sup>2</sup> *Tj,j (μ*; *α)* excluding interaction terms, and (rhs) interaction terms <sup>−</sup><sup>1</sup> <sup>2</sup> *Tj,k(μ*; *α)*, *j* = *k*

attributions (7.76), and the red line the 2nd order marginal attributions (7.78). The cyan line drops the interaction terms *Tj,k(μ*; *α)*, *j* = *k*, from the 2nd order marginal attributions. From the shaded cyan area we see the importance of the interaction terms. We note that the 2nd order marginal attributions (red line) match the true empirical quantiles (black dots) quite well for the chosen reference point *a*.

Figure 7.45 gives the 2nd order marginal attributions *Sj (μ*; *α)* <sup>−</sup> <sup>1</sup> <sup>2</sup>*Tj,j (μ*; *α)* of the individual components 1 ≤ *j* ≤ *q* on the left-hand side, and the interaction terms −1 <sup>2</sup>*Tj,k(μ*; *α)*, *j* = *k* on the right-hand side. We identify the following components as being important BonusMalus, DrivAge, VehGas, VehBrand and Region; these components show a behavior substantially different from being equal to 0, i.e.,

**Fig. 7.46** (lhs) Second order marginal attributions *Sj (μ*; *α)* <sup>−</sup> <sup>1</sup> 2 *<sup>q</sup> <sup>k</sup>*=<sup>1</sup> *Tj,k(μ*; *α)* including interaction terms, and (rhs) slices at the quantile levels *α* ∈ {20%*,* 40%*,* 60%*,* 80%}

these components differentiate from the reference point *a*. These components also have major interactions that contribute to the quantiles above the level 80%.

If we allocate the interaction terms to the corresponding components 1 ≤ *j* ≤ *q* we receive the second order marginal attributions *Sj (μ*; *α)* <sup>−</sup> <sup>1</sup> 2 *<sup>q</sup> <sup>k</sup>*=<sup>1</sup> *Tj,k (μ*; *α)*. These are illustrated in Fig. 7.46 (lhs) and the quantile slices at the levels *α* ∈ {20%*,* 40%*,* 60%*,* 80%} are given in Fig. 7.46 (rhs). These graphs illustrate variable importance on different quantile levels (and respecting the dependence within the features). In particular, we identify the main variables that distinguish the given quantile levels from the reference level *θ (a)*, i.e., Fig. 7.46 (rhs) should be understood as the relative differences to the chosen reference level. Once more we see that BonusMalus is the main driver, but also other variables contribute to the differentiation of the high quantile levels.

Figure 7.47 shows the individual attributions *xi,jμj (xi)* of 1'000 randomly selected cases *x<sup>i</sup>* for the feature components *j* = BonusMalus*,* DrivAge*,* VehGas*,* VehBrand; the colors illustrate the corresponding feature values *xi,j* of the individual car drivers *<sup>i</sup>*, and the black solid line corresponds to *Sj (μ*; *α)* <sup>−</sup> <sup>1</sup> <sup>2</sup>*Tj,j (μ*; *α)* excluding the interaction terms (the black dotted line is one empirical standard deviation around the black solid line). Focusing on the variable BonusMalus we observe that the lower quantiles are almost completely dominated by insurance policies on the lowest bonus-malus level. The bonus-malus levels 70–80 provide little sensitivity (are concentrated around the zero line) because the reference point *a* reflects these bonus-malus levels, and, finally, the large quantiles are dominated by high bonus-malus levels (red dots).

The plot of the variable DrivAge is interpreted similarly. The reference point *a* is close to the young drivers, therefore, young drivers are concentrated around the zero line. At the low quantile levels, higher ages contribute positively to the low expected frequencies, whereas these ages have an unfavorable impact at higher

**Fig. 7.47** Individual attributions *xi,jμj (xi)* of 1'000 randomly selected cases *x<sup>i</sup>* for *j* = BonusMalus*,* DrivAge*,* VehGas*,* VehBrand; the plots have different *y*-scales

quantile levels (this should be considered in combination with their bonus-malus levels). We also observe a few outliers in this plot, for instance, we can identify a driver of age 20 at a quantile level of 20%. Further inspection of this driver raises some doubts whether this data is correct since this driver is at a bonus-malus level of 68% (which should technically not be possible) and she/he has an exposure of 2 days. Surely, this insurance policy would need further investigation.

The plot of VehGas shows that the chosen reference level *θ (a)* is closer to Diesel fuel cars as the red dots fluctuate less around the zero line; in different runs of the gradient descent algorithm (with different seeds) this order has been changing (as it depends on the reference point *a*). We skip a detailed analysis of the variable VehBrand. -

## **7.7 Lab: Analysis of the Fitted Networks**

In the previous section we have studied some model-agnostic tools that can be used for any (differentiable) regression model. In this section we give some network specific plots. For simplicity we choose one specific example, namely, the FN network *<sup>μ</sup>* def*.* <sup>=</sup> *<sup>μ</sup>m*=<sup>1</sup> of Table 7.9. We start by analyzing the learned representations in the different FN layers, this links to our introduction in Sect. 7.1.

For any FN layer 1 ≤ *m* ≤ *d* we can study the learned representations *z(m*:1*) (x)*. For Fig. 7.48 we select at random 1'000 insurance policies *xi*, and the dots show the activations of these insurance policies in neurons *j* = 4 (*x*-axis) and *j* = 9 (*y*-axis) in the corresponding FN layers. These neuron activations are in the interval *(*−1*,* 1*)* because we work with the hyperbolic tangent activation function for *<sup>φ</sup>*. The color scale shows the resulting estimated frequencies *μ(xi)* of the selected policies. We observe that the layers are increasingly (in the depth of the network) separating the low frequency policies (light blue-green colors) from the high frequency policies (red color). This is a quite typical picture that we obtain here, though, this sparsity in the 3rd FN layer is not the case for every neuron 1 ≤ *j* ≤ *qd* .

In higher dimensional FN architectures it will be difficult to analyze the learned representations on each individual neuron, but at least one can try to understand the main effects learned. For this, on the one hand, we can focus on the important feature components, see, e.g., Sect. 7.6.1, and, on the other hand, we can try to study the main effects learned using a PCA in each FN layer, see Sect. 7.5.3. Figure 7.49 shows the singular values *λ*<sup>1</sup> ≥ *λ*<sup>2</sup> ≥ *...* ≥ *λqm >* 0 in each of the three FN layers 1 ≤ *m* ≤ *d* = 3; we center the neuron activations to mean zero before applying the SVD. These plots support the previously made statement that the layers are increasingly separating the high frequency from the low frequency policies. An elbow criterion tells us that in the first FN layer we have 8 important principal components (out of 20), in the second FN layer 3 (out of 15) and in the third FN layer 1 (out of 10). This is also reflected in Fig. 7.48 where we see more and more

**Fig. 7.48** Observed activations in the three FN layers *m* = 1*,* 2*,* 3 (left-middle-right) in the corresponding neurons *<sup>j</sup>* <sup>=</sup> <sup>4</sup>*,* 9, the color key shows the estimated frequencies *μ(xi)*

**Fig. 7.49** Singular values *λ*<sup>1</sup> ≥ *λ*<sup>2</sup> ≥ *...* ≥ *λqm >* 0 in the FN layers 1 ≤ *m* ≤ *d* = 3

concentration in the neuron activations. It is important to notice that the chosen FN network calibration *<sup>μ</sup>* does not involve any drop-out layers during the gradient descent fitting, see Sect. 7.4.1. Drop-out layers prevent individual neurons to overtrain to a specific task. Consequently, we will receive a network calibration that is more equally balanced across all neurons under drop-outs, because if one neuron drops out, the composite of the remaining neurons needs to be able to take over the task of the dropped out neuron. This leads to less sparsity and to singular values that are more similarly sized.

In Fig. 7.50 we analyze the first two principal components in each FN layer, thus, these are the two principal components that correspond to the two biggest singular values *(λ*1*, λ*2*)* in each of the three FN layers. The first row shows the input variables *(*BonusMalus*,* DrivAge*)* ∈ [50*,* 125]×[18*,* 90] of the 1'000 randomly selected policies *xi*; these are the two most important feature components according to the VPI analysis. All three columns show the same data, however, in different color scales: (lhs) gives the color scale *<sup>μ</sup>*, (middle) gives the color scale BonusMalus, and (rhs) gives the color scale DrivAge. These color scales also apply to the other rows. The 2nd row shows the first two principal components in the 1st FN layer, the 3rd row in the 2nd FN layer, and the last row in the third FN layer. Focusing on the first column we observe that the layers cluster the high and the low frequency policies in the 1st principal component more and more across the FN layers. Not surprisingly this leads to a quite clear-cut separation w.r.t. the bonus-malus level which can be verified from the second column of Fig. 7.50. For the driver's age variable this sharp separation gets lost across the layers, see third column of Fig. 7.50, which indicates that the variable DrivAge does not influence the frequency monotonically and it interacts with the variable BonusMalus.

Figure 7.51 shows the second order marginal attributions (7.78) for the different inputs. The graph on the left-hand side shows the plot w.r.t. the original inputs *xi*, the graph in the middle w.r.t. the learned representations *z(*1:1*) (xi)* <sup>∈</sup> <sup>R</sup>*q*<sup>1</sup> in the first FN layer, and on the right-hand side w.r.t. the learned representations *z(*2:1*) (xi)* <sup>∈</sup> <sup>R</sup>*q*<sup>2</sup> in the second FN layer. We interpret these plots as follows: the FN network disentangles the different effects through the FN layers by making

**Fig. 7.50** (First row) Input variables *(*BonusMalus*,* DrivAge*)*, (Second–fourth row) first two principal components in FN layers *m* = 1*,* 2*,* 3; (lhs) gives the color scale of estimated frequency *<sup>μ</sup>*, (middle) gives the color scale BonusMalus, and (rhs) gives the color scale DrivAge

**Fig. 7.51** Second order marginal attributions: (lhs) w.r.t. the input layer *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*q*<sup>0</sup> , (middle) w.r.t. the first FN layer *z(*1:1*) (x)* <sup>∈</sup> <sup>R</sup>*q*<sup>1</sup> , and (rhs) w.r.t. the second FN layer *<sup>z</sup>(*2:1*) (x)* <sup>∈</sup> <sup>R</sup>*q*<sup>2</sup>

the plots more smooth and making the interactions between the neurons smaller. Note that the learned representations *z(*3:1*) (xi)* <sup>∈</sup> <sup>R</sup>*q*<sup>3</sup> in the last FN layer go into a classical GLM for the output layer, which does not have any interactions in the canonical predictor (because it is additive on the canonical scale), thus, being of the same type as the linear regression of Example 7.37. In the Poisson model with the log-link function, the interactions can only be of a multiplicative type in GLMs. Therefore, the network feature-engineers the input *x<sup>i</sup>* (in an automated way) such that the learned representation *z(d*:1*) (xi)* in the last FN layer is exactly in this GLM structure. This is verified by the small interaction part in Fig. 7.51 (rhs). This closes this part on model-agnostic tools.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 8 Recurrent Neural Networks**

Chapter 7 has discussed fully-connected *feed-forward neural* (FN) networks. Feedforward means that information is passed in a directed acyclic path from the input layer to the output layer. A natural extension is to allow these networks to have cycles. In that case, we call the architecture a *recurrent neural* (RN) network. A RN network architecture is particularly useful for time-series modeling. The discussion on time-series data also links to Sect. 5.8.1 on longitudinal and panel data. RN networks have been introduced in the 1980s, and the two most popular RN network architectures are the long short-term memory (LSTM) architecture proposed by Hochreiter–Schmidhuber [188] and the gated recurrent unit (GRU) architecture introduced by Cho et al. [76]. These two architectures will be described in detail in this chapter.

## **8.1 Motivation for Recurrent Neural Networks**

We start from a deep FN network providing the regression function, see (7.2)–(7.3),

$$\mathbf{x} \mapsto \mu(\mathbf{x}) = \operatorname{g}^{-1} \langle \mathfrak{B}, \mathbf{z}^{(d:\mathbb{I})}(\mathbf{x}) \rangle,\tag{8.1}$$

with a composition *<sup>z</sup>(d*:1*)* of *<sup>d</sup>* FN layers *<sup>z</sup>(m)*, 1 <sup>≤</sup> *<sup>m</sup>* <sup>≤</sup> *<sup>d</sup>*, link function *<sup>g</sup>* and with output parameter *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*qd*+1. In principle, we could directly use this FN network architecture for time-series forecasting. We explain here why this is not the best option to deal with time-series data.

Assume we want to predict a random variable *YT* <sup>+</sup><sup>1</sup> at time *T* ≥ 0 based on the time-series information *x*0*, x*1*,..., x<sup>T</sup>* . This information is assumed to be available at time *T* for predicting the response *YT* <sup>+</sup>1. The past response information *Yt* , 1 ≤ *t* ≤ *T* , is typically included in *x<sup>t</sup>* . <sup>1</sup> Using the above FN network architecture we could directly try to predict *YT* <sup>+</sup>1, based on this past information. Therefore, we define the feature information *x*0:*<sup>T</sup>* = *(x*0*,..., x<sup>T</sup> )* and we aim at designing a FN network (8.1) for modeling

$$\mathbf{x}\_{0:T} \mapsto \mu\_T(\mathbf{x}\_{0:T}) = \mathbb{E}[Y\_{T+1}|\mathbf{x}\_{0:T}] = \mathbb{E}[Y\_{T+1}|\mathbf{x}\_0, \dots, \mathbf{x}\_T].$$

In principle we could work with such an approach, however, it has a couple of severe drawbacks. Obviously, the length of the feature vector *x*0:*<sup>T</sup>* depends on time *T* , that is, it will grow with every time step. Therefore, the regression function (network architecture) *x*0:*<sup>T</sup>* → *μT (x*0:*<sup>T</sup> )* is time-dependent. Consequently, with this approach we have to fit a network for every *T* . This deficiency can be circumvented if we assume a Markov property that does not require of carrying forward the whole past history. Assume that it is sufficient to consider a history of a certain length. Choose *τ* ≥ 0 fixed, then, for *T* ≥ *τ* , we can set for the feature information *x<sup>T</sup>* <sup>−</sup>*<sup>τ</sup>* :*<sup>T</sup>* = *(x<sup>T</sup>* <sup>−</sup>*<sup>τ</sup> ,..., x<sup>T</sup> )*, which has a fixed length *τ* + 1 ≥ 1, now. In this situation we could try to design a FN network

$$\mathbf{x}\_{T-\mathbf{r}:T} \mapsto \mu(\mathbf{x}\_{T-\mathbf{r}:T}) = \mathbb{E}[Y\_{T+1}|\mathbf{x}\_{T-\mathbf{r}:T}] = \mathbb{E}[Y\_{T+1}|\mathbf{x}\_{T-\mathbf{r}}, \dots, \mathbf{x}\_T].$$

This network regression function can be chosen independent of *T* since the relevant history *x<sup>T</sup>* <sup>−</sup>*<sup>τ</sup>* :*<sup>T</sup>* always has the same length *τ* +1. The time variable *T* could be used as a feature component in *x<sup>T</sup>* <sup>−</sup>*<sup>τ</sup>* :*<sup>T</sup>* . The disadvantage of this approach is that such a FN network architecture does not respect the temporal causality. Observe that we feed the past history into the first FN layer

$$\mathbf{x}\_{T-\mathfrak{r}:T} \mapsto \mathfrak{z}^{(\mathfrak{l})}(\mathbf{x}\_{T-\mathfrak{r}:T}) \in \{\mathfrak{l}\} \times \mathbb{R}^{q\_1}.$$

This operation typically does not respect any topology in the time index of *x<sup>T</sup>* <sup>−</sup>*τ*+1:*<sup>T</sup>* . Thus, the FN network does not recognize that the feature *xt*−<sup>1</sup> has been experienced just before the next feature *x<sup>t</sup>* . For this reason we are looking for a network architecture that can handle the time-series information in a temporal causal way.

<sup>1</sup> More mathematically speaking, we assume to have a filtration *(At)t*≥<sup>0</sup> on the probability space *(-, <sup>A</sup>,* <sup>P</sup>*)*. The basic assumption then is that both sequences *(xt)t* and *(Yt)t* are *(At)t*-adapted, and we aim at predicting *YT* <sup>+</sup>1, based on the information *A<sup>T</sup>* . In the above case this information *A<sup>T</sup>* is generated by *x*0*, x*1*,..., x<sup>T</sup>* , where *x<sup>t</sup>* typically includes the observation *Yt* . We could also shift the time index in *x<sup>t</sup>* by one time unit, and in that case we would assume that *(xt)t* is previsible w.r.t. the filtration *(At)t* . We do not consider this shift in time index as it only makes the notation unnecessarily more complicated, but the results remain the same by including the information correspondingly into the features.

## **8.2 Plain-Vanilla Recurrent Neural Network**

## *8.2.1 Recurrent Neural Network Layer*

We explain the basic idea of RN networks in a shallow network architecture, and deep network architectures will be discussed in Sect. 8.2.2, below. We start from the time-series input variable *x*0:*<sup>T</sup>* = *(x*0*,..., x<sup>T</sup> )*, all components having the same structure *<sup>x</sup><sup>t</sup>* <sup>∈</sup> *<sup>X</sup>* ⊂ {1} × <sup>R</sup>*q*<sup>0</sup> , 0 <sup>≤</sup> *<sup>t</sup>* <sup>≤</sup> *<sup>T</sup>* . The aim is to design a network architecture that allows us to predict the random variable *YT* <sup>+</sup>1, based on this timeseries information *x*0:*<sup>T</sup>* .

The main idea is to feed one component *x<sup>t</sup>* of the time-series *x*0:*<sup>T</sup>* at a time into the network, and at the same time we use the output *zt*−<sup>1</sup> of the previous loop as an input for the next loop. This variable *zt*−<sup>1</sup> carries forward a memory of the past variables *<sup>x</sup>*0:*t*−1. We explain this with a single RN layer having *<sup>q</sup>*<sup>1</sup> <sup>∈</sup> <sup>N</sup> neurons. A RN layer is given (recursively) by a mapping, *t* ≥ 1,

$$\mathbf{z}^{(1)}: \{1\} \times \mathbb{R}^{q\_0} \times \mathbb{R}^{q\_1} \to \mathbb{R}^{q\_1},\tag{8.2}$$

$$(\mathbf{x}\_l, \mathbf{z}\_{l-1}) \mapsto \mathbf{z}\_l = \mathbf{z}^{(1)}\left(\mathbf{x}\_l, \mathbf{z}\_{l-1}\right),$$

where the RN layer *z(*1*)* has the same structure as the FN layer given in (7.5), but based on feature input *(xt, <sup>z</sup>t*−1*)* <sup>∈</sup> *<sup>X</sup>* <sup>×</sup> <sup>R</sup>*q*<sup>1</sup> ⊂ {1} × <sup>R</sup>*q*<sup>0</sup> <sup>×</sup> <sup>R</sup>*q*<sup>1</sup> , and not including an intercept component {1} in the output.

More formally, a *RN layer* with activation function *φ* is a mapping

$$\mathbf{z}^{(1)}: \{1\} \times \mathbb{R}^{q\_0} \times \mathbb{R}^{q\_1} \to \mathbb{R}^{q\_1} \tag{8.3}$$

$$(\mathbf{x}, \mathbf{z}) \mapsto \mathbf{z}^{(1)}(\mathbf{x}, \mathbf{z}) = \left(z\_1^{(1)}(\mathbf{x}, \mathbf{z}), \dots, z\_{q\_l}^{(1)}(\mathbf{x}, \mathbf{z})\right)^{\top},$$

having neurons, 1 ≤ *j* ≤ *q*1,

$$z\_j^{(1)}(\mathbf{x}, \mathbf{z}) = \phi\left(\left<\mathbf{w}\_j^{(1)}, \mathbf{x}\right> + \left<\mathbf{u}\_j^{(1)}, \mathbf{z}\right>\right),\tag{8.4}$$

for given network weights *w(*1*) <sup>j</sup>* <sup>∈</sup> <sup>R</sup>*q*0+<sup>1</sup> and *<sup>u</sup>(*1*) <sup>j</sup>* <sup>∈</sup> <sup>R</sup>*q*<sup>1</sup> .

Thus, the FN layers (7.5)–(7.6) and the RN layers (8.3)–(8.4) are structurally equivalent, only the input *x* ∈ *X* is adapted to the time-series structure *(xt, zt*−1*)* ∈ *<sup>X</sup>* <sup>×</sup> <sup>R</sup>*q*<sup>1</sup> . Before giving more interpretation and before explaining how this single RN network structure can be extended to a deep RN network we illustrate this RN layer.

output *zt*

**Fig. 8.1** RN layer *<sup>z</sup>(*1*)* processing the input *(xt, <sup>z</sup>t*−1*)*

time-series input *x<sup>t</sup>*

RN layer *<sup>z</sup>*(1)(*xt, <sup>z</sup>t*−1) processing input (*xt, zt*−1)

**Fig. 8.2** Unfolded representation of RN layer *<sup>z</sup>(*1*)* processing the input *(xt, <sup>z</sup>t*−1*)*

Figure 8.1 shows an RN layer *<sup>z</sup>(*1*)* processing the input *(xt, <sup>z</sup>t*−1*)*, see (8.2). From this graph, the recurrent structure becomes clear since we have a loop (cycle) feeding the output *z<sup>t</sup>* back into the RN layer to process the next input *(xt*+1*, zt)*.

Often one depicts the RN architecture in a so-called unfolded way. This is done in Fig. 8.2. Instead of plotting the loop (cycle) as in Fig. 8.1 (orange arrow in the colored version), we unfold this loop by plotting the RN layer multiple times. Note that this RN layer in Fig. 8.2 uses always the same network weights *w(*1*) <sup>j</sup>* and *<sup>u</sup>(*1*) j* , 1 ≤ *j* ≤ *q*1, for all *t*. Moreover, the use of the colors of the arrows (in the colored version) in the two figures coincides.

#### *Remarks 8.1*

• The neurons of the RN layer (8.4) have the following structure

$$z\_j^{(1)}(\mathbf{x}, \mathbf{z}) = \phi\left(\langle \mathbf{w}\_j^{(1)}, \mathbf{x} \rangle + \langle \mathbf{u}\_j^{(1)}, \mathbf{z} \rangle\right) = \phi\left(w\_{0,j}^{(1)} + \sum\_{l=1}^{q\_0} w\_{l,j}^{(1)} \mathbf{x}\_l + \sum\_{l=1}^{q\_1} u\_{l,j}^{(1)} z\_l\right).$$

The network weights *<sup>W</sup>(*1*)* <sup>=</sup> *(w(*1*) <sup>j</sup> )*1≤*j*≤*q*<sup>1</sup> <sup>∈</sup> <sup>R</sup>*(q*0+1*)*×*q*<sup>1</sup> include an intercept component *w(*1*)* <sup>0</sup>*,j* and the network weights *<sup>U</sup>(*1*)* <sup>=</sup> *(u(*1*) <sup>j</sup> )*1≤*j*≤*q*<sup>1</sup> <sup>∈</sup> <sup>R</sup>*q*1×*q*<sup>1</sup> do not include an intercept component, otherwise we would have a redundancy.


## *8.2.2 Deep Recurrent Neural Network Architectures*

There are many different ways of extending a shallow RN network to a deep RN network. Assume we want to model a RN network of depth *d* ≥ 2. A first (obvious) way of receiving a deep RN network architecture is

$$\mathbf{z}\_{t}^{[1]} = \mathbf{z}^{(1)}\left(\mathbf{x}\_{t}, \mathbf{z}\_{t-1}^{[1]}\right) \quad \in \mathbb{R}^{q\_{1}},\tag{8.5}$$

$$\mathbf{z}\_{t}^{[m]} = \mathbf{z}^{(m)}\left(\mathbf{z}\_{t}^{[m-1]}, \mathbf{z}\_{t-1}^{[m]}\right) \quad \text{\leftarrow } \mathbb{R}^{q\_{m}} \qquad \text{for } 2 \le m \le d,\tag{8.6}$$

where all RN layers *<sup>z</sup>(m)*, 1 <sup>≤</sup> *<sup>m</sup>* <sup>≤</sup> *<sup>d</sup>*, are of type (8.3)–(8.4), and additionally we include an intercept component in the RN layers *<sup>z</sup>(m)*, 2 <sup>≤</sup> *<sup>m</sup>* <sup>≤</sup> *<sup>d</sup>*. We add the upper indices (in square brackets [·]) to the time-series *(z* [*m*] *<sup>t</sup> )t* to indicate which RN layer outputs these learned representations (memory processes). In fact, we could also write *<sup>z</sup>*[*m*:1] *<sup>t</sup>* instead of *<sup>z</sup>*[*m*] *<sup>t</sup>* , because in *<sup>z</sup>*[*m*:1] *<sup>t</sup>* the feature input *<sup>x</sup>*0:*<sup>t</sup>* has been processed through *m* RN layers *z(*1*) ,..., z(m)*. For simplicity, we just use the notation *<sup>z</sup>*[*m*] *<sup>t</sup>* <sup>=</sup> *<sup>z</sup>* [*m*] *<sup>t</sup> (x*0:*t)*.

We are going to use the following abbreviation for a RN layer *m* ≥ 1

$$\mathbf{z}\_{l}^{[m]} = \mathbf{z}^{(m)}\left(\mathbf{z}\_{l}^{[m-1]}, \mathbf{z}\_{l-1}^{[m]}\right) = \phi\left(\left + \left\right) \tag{8.7}$$

where the weights *<sup>W</sup>(m)* <sup>=</sup> *(w(m)* <sup>1</sup> *,..., <sup>w</sup>(m) qm )* <sup>∈</sup> <sup>R</sup>*(qm*−1+1*)*×*qm* include the intercept components, and the weights *<sup>U</sup>(m)* <sup>=</sup> *(u(m)* <sup>1</sup> *,..., <sup>u</sup>(m) qm )* <sup>∈</sup> <sup>R</sup>*qm*×*qm* do not include any intercept components. The scalar product is understood column-wise in the weight matrices *W(m)* and *U(m)*, and the activation *φ* is understood component-wise. Moreover, we initialize for the input *z*[0] *<sup>t</sup>* = *x<sup>t</sup>* .

**Fig. 8.3** Unfolded representation of a RN network architecture of depth *d* = 2

Figure 8.3 shows the RN network architecture of depth *d* = 2 defined in (8.5)–(8.6). The dimension of the input *z* [0] *<sup>t</sup>* <sup>=</sup> *<sup>x</sup><sup>t</sup>* <sup>∈</sup> *<sup>X</sup>* ⊆ {1} × <sup>R</sup>*q*<sup>0</sup> is *<sup>q</sup>*<sup>0</sup> <sup>+</sup> 1, the first RN layer has *q*<sup>1</sup> neurons and the second RN layer *q*<sup>2</sup> neurons. From this graph it becomes clear how a RN network architecture of any depth *<sup>d</sup>* <sup>∈</sup> <sup>N</sup> can be constructed (recursively).

*Remark 8.2* There are many alternative ways in building deep RN networks. E.g., we can add a loop that connects the output of the second RN layer back to the first one

$$\begin{aligned} z\_{l}^{[1]} &= z^{(1)}\left(\mathbf{z}\_{l}, z\_{l-1}^{[1]}, \mathbf{z}\_{l-1}^{[2]}\right), \\ z\_{l}^{[2]} &= z^{(2)}\left(\mathbf{z}\_{l}^{[1]}, \mathbf{z}\_{l-1}^{[2]}\right), \end{aligned}$$

or we can add a skip connection from the input variable *x<sup>t</sup>* to the second RN layer

$$\begin{aligned} z\_{l}^{[1]} &= z^{(1)} \left( \mathbf{x}\_{l}, z\_{l-1}^{[1]} \right), \\ z\_{l}^{[2]} &= z^{(2)} \left( \mathbf{x}\_{l}, z\_{l}^{[1]}, z\_{l-1}^{[2]} \right). \end{aligned}$$

*.*

We refrain from explicitly studying such RN network variants any further.

## *8.2.3 Designing the Network Output*

There remains to explain how to predict the response variable *YT* <sup>+</sup><sup>1</sup> based on the pre-processed features (memory processes) *z* [1] *<sup>T</sup> ,..., z* [*d*] *<sup>T</sup>* , outputted by the RN network of depth *d* ≥ 1. Typically, only the final output of the last RN layer *z* [*d*] *<sup>T</sup>* <sup>=</sup> *<sup>z</sup>*[*d*] *<sup>T</sup> (x*0:*<sup>T</sup> )* <sup>∈</sup> <sup>R</sup>*qd* is considered to predict the response *YT* <sup>+</sup>1. We take this output and feed it into a FN network *<sup>z</sup>*¯*(D*:1*)* : {1} × <sup>R</sup>*qd* → {1} × <sup>R</sup>*q*¯*<sup>D</sup>* of depth *<sup>D</sup>* <sup>∈</sup> <sup>N</sup> and with FN layers *<sup>z</sup>*¯*(m)*, 1 <sup>≤</sup> *<sup>m</sup>* <sup>≤</sup> *<sup>D</sup>*, given by (7.5). Moreover, we choose a strictly monotone and smooth link function *g*.

This then provides us with the regression function, see (7.7)–(7.8),

$$\mathbf{x}\_{0:T} \mapsto \mathbb{E}[Y\_{T+1}|\mathbf{x}\_{0:T}] = \mu(\mathbf{x}\_{0:T}) = \mathbf{g}^{-1} \left\langle \mathbf{\mathcal{B}}, \bar{\mathbf{z}}^{(D:1)} \left( \mathbf{z}\_T^{[d]}(\mathbf{x}\_{0:T}) \right) \right\rangle. \tag{8.8}$$

Thus, we first process the time-series features *x*0:*<sup>T</sup>* through a RN network to receive the learned representation *z* [*d*] *<sup>T</sup> (x*0:*<sup>T</sup> )* <sup>∈</sup> <sup>R</sup>*qd* at time *<sup>T</sup>* . This learned representation is then used as a feature input to a FN network *<sup>z</sup>*¯*(D*:1*)* that allows us to predict the response *YT* <sup>+</sup>1. This is illustrated in Fig. 8.4 for depth *d* = 1.

#### *Remarks 8.3*


**Fig. 8.4** Forecasting the response *YT* <sup>+</sup><sup>1</sup> using a RN network (8.8) based on a single RN layer *d* = 1 and on a FN network of depth *D*

There remains to fit this network architecture having *d* RN layers and *D* FN layers to the available data. The RN layers involve the network weights *<sup>W</sup>(m)* <sup>∈</sup> <sup>R</sup>*(qm*−1+1*)*×*qm* and *<sup>U</sup>(m)* <sup>∈</sup> <sup>R</sup>*qm*×*qm* , for 1 <sup>≤</sup> *<sup>m</sup>* <sup>≤</sup> *<sup>d</sup>*, and the FN layers involve the network weights *(w*¯ *(m) <sup>j</sup> )*1≤*j*≤ ¯*qm* <sup>∈</sup> <sup>R</sup>*(q*¯*m*−1+1*)*× ¯*qm*, for 1 <sup>≤</sup> *<sup>m</sup>* <sup>≤</sup> *<sup>D</sup>*, and with *<sup>q</sup>*¯<sup>0</sup> <sup>=</sup> *qd* . Moreover, we have an output parameter *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*q*¯*D*<sup>+</sup>1. The fitting is again done by a gradient descent algorithm minimizing the corresponding objective function.

Assume we have independent (in *i*) data *(Yi,T* <sup>+</sup>1*, xi,*0:*<sup>T</sup> , vi,T* <sup>+</sup>1*)* of the cases 1 ≤ *i* ≤ *n*. We then assume that the responses *Yi,T* <sup>+</sup><sup>1</sup> can be modeled by a fixed member of the EDF having unit deviance d. We consider the deviance loss function, see (4.9),

$$\mathfrak{d} \models \mathfrak{D}(Y\_{T+1}, \mathfrak{d}) = \frac{1}{n} \sum\_{l=1}^{n} \frac{\upsilon\_{l, T+1}}{\varphi} \mathfrak{d}\left(Y\_{l, T+1}, \mu\_{\mathfrak{d}}(\mathbf{x}\_{l, 0:T})\right), \tag{8.9}$$

for the observations *Y<sup>T</sup>* <sup>+</sup><sup>1</sup> = *(Y*1*,T* <sup>+</sup><sup>1</sup>*,...,Yn,T* <sup>+</sup>1*)*-, and where *ϑ* collects all the RN and FN network weights/parameters of the regression function (8.8). This model can now be fitted using a variant of the gradient descent algorithm. The variant uses back-propagation through time (BPTT) which is an adaption of the backpropagation method to calculate the gradient w.r.t. the network parameter *ϑ*.

## *8.2.4 Time-Distributed Layer*

There is a special feature in RN network modeling which is called a *time-distributed layer*. Observe from Fig. 8.4 that the deviance loss function (8.9) only focuses on the final observation *Yi,T* <sup>+</sup>1. However, the stationarity assumption allows us to output and study any (previous) observation *Yi,t*+1, 0 ≤ *t* ≤ *T* . A time-distributed layer considers applying the deep FN network (8.8) *simultaneously* at all time points 0 ≤ *t* ≤ *T* ; simultaneously meaning that we use the same FN network weights for all *t*. The latter is justified under the assumption of having stationarity.

This then provides us with the regressions

$$\mathbf{x}\_{01} \mapsto \mathbb{E}[Y\_{l+1}|\mathbf{x}\_{01}] = \mu(\mathbf{x}\_{01}) = \mathbf{g}^{-1}\left\langle \mathbf{\mathcal{S}}, \bar{\mathbf{z}}^{(D;1)}\left(\mathbf{z}\_{l}^{[d]}(\mathbf{x}\_{01})\right) \right\rangle \qquad \text{for all } l \ge 0. \tag{8.10}$$

Figure 8.5 illustrates a time-distributed output where we predict *(Yt*+<sup>1</sup>*)t* based on the history *(x*0:*t)t* , and we always apply the same FN network *<sup>z</sup>*¯*(D*:1*)* to the memory *z* [1] *<sup>t</sup>* = *z* [1] *<sup>t</sup> (x*0:*t)*.

A time-distributed layer changes the fitting procedure. Instead of considering the objective function (8.9) for the final observation *Yi,T* <sup>+</sup>1, we now include all observations *Y* = *(Yi,t*+1*)*0≤*t*≤*T ,*1≤*i*≤*<sup>n</sup>* into the objective function. This results in studying the deviance loss function

$$\mathfrak{d} \mapsto \mathfrak{D}(Y, \mathfrak{d}) = \frac{1}{n} \sum\_{l=1}^{n} \frac{1}{T+1} \sum\_{t=0}^{T} \frac{v\_{l, t+1}}{\varphi} \mathfrak{d}\left(Y\_{l, t+1}, \mu\_{\mathfrak{d}}(\mathfrak{x}\_{l, 0:t})\right). \tag{8.11}$$

**Fig. 8.5** Forecasting *(Yt*+<sup>1</sup>*)t* using a RN network (8.10) based on a single RN layer *d* = 1 and using a time-distributed FN layer for the outputs

Note that this can easily be adapted if the different cases 1 ≤ *i* ≤ *n* have different lengths in their histories. An example is provided in Listing 10.8, below.

## **8.3 Special Recurrent Neural Networks**

In the plain-vanilla RN networks introduced above we have defined the memory processes *(z* [*m*] *<sup>t</sup> )t*≥0, 1 <sup>≤</sup> *<sup>m</sup>* <sup>≤</sup> *<sup>d</sup>*, which encode the information history *(xt)t*≥<sup>0</sup> through different RN layers in a temporal causal way. This is naturally done through the use of a time-series structure as illustrated, e.g., in Fig. 8.5. There are more specific RN network architectures that allow the memory processes to be of a long memory or a short memory type. In this section, we present the two most popular architectures that pay a special attention to the memory storage. This is the long short-term memory (LSTM) architecture introduced by Hochreiter–Schmidhuber [188] and the gated recurrent unit (GRU) architecture proposed by Cho et al. [76].

## *8.3.1 Long Short-Term Memory Network*

The LSTM network of Hochreiter–Schmidhuber [188] is the most commonly used RN network architecture. The LSTM network uses simultaneously three different activation functions for different purposes, the sigmoid and hyperbolic tangent activation functions, respectively,

$$\phi\_{\sigma}(\mathbf{x}) = \frac{1}{1 + e^{-\mathbf{x}}} \in (0, 1) \qquad \text{and} \qquad \phi\_{\tanh}(\mathbf{x}) = \frac{e^{\mathbf{x}} - e^{-\mathbf{x}}}{e^{\mathbf{x}} + e^{-\mathbf{x}}} \in (-1, 1),$$

and a general activation function *<sup>φ</sup>* : <sup>R</sup> <sup>→</sup> <sup>R</sup>, see also Table 7.1.

The LSTM network relies on several RN layers that are of the same structure as the plain-vanilla RN layer given in (8.7). We start by defining three different socalled *gates* that all have the RN layer structure (8.7). These three gates are used to model the memory cell of the LSTM network. Choose a layer index *m* ≥ 1 and assume that *<sup>z</sup>*[*m*−1] *<sup>t</sup>* is modeled by the previous layer *<sup>m</sup>* <sup>−</sup> 1; for *<sup>m</sup>* <sup>=</sup> 1 we initialize *z* [0] *<sup>t</sup>* = *x<sup>t</sup>* . The three gates are then defined as follows, set *t* ≥ 1:

• The *forget gate* models the loss of memory rate

$$\mathcal{f}\_t^{\{m\}} = \mathcal{f}^{(m)}\left(\mathbf{z}\_t^{\{m-1\}}, \mathbf{z}\_{t-1}^{\{m\}}\right) = \phi\_\sigma^f\left(\left + \left\right) \in (0,1)^{q\_m},$$

with the network weights *W(m) <sup>f</sup>* <sup>∈</sup> <sup>R</sup>*(qm*−1+1*)*×*qm* and *<sup>U</sup>(m) <sup>f</sup>* <sup>∈</sup> <sup>R</sup>*qm*×*qm* , and with the sigmoid activation function *φ<sup>f</sup> <sup>σ</sup>* = *φσ* , we also refer to (8.7).

#### 8.3 Special Recurrent Neural Networks 391

• The *input gate* models the memory update rate

$$\mathfrak{i}\_l^{[m]} = \mathfrak{i}^{(m)}\left(\mathfrak{z}\_l^{[m-1]}, \mathfrak{z}\_{l-1}^{[m]}\right) = \phi\_\sigma^l\left(\left + \left\right) \in (0,1)^{q\_m},$$

with the network weights *W(m) <sup>i</sup>* <sup>∈</sup> <sup>R</sup>*(qm*−1+1*)*×*qm* and *<sup>U</sup>(m) <sup>i</sup>* <sup>∈</sup> <sup>R</sup>*qm*×*qm* , and with the sigmoid activation function *φ<sup>i</sup> <sup>σ</sup>* = *φσ* .

• The *output gate* models the release of memory information rate

$$\boldsymbol{\sigma}\_{l}^{[m]} = \boldsymbol{\sigma}^{(m)}\left(\mathbf{z}\_{l}^{[m-1]}, \mathbf{z}\_{l-1}^{[m]}\right) = \boldsymbol{\phi}\_{\sigma}^{o}\left(\left<\boldsymbol{W}\_{o}^{(m)}, \mathbf{z}\_{l}^{[m-1]}\right> + \left<\boldsymbol{U}\_{o}^{(m)}, \mathbf{z}\_{l-1}^{[m]}\right>\right) \tag{8.12}$$

with the network weights *W(m) <sup>o</sup>* <sup>∈</sup> <sup>R</sup>*(qm*−1+1*)*×*qm* and *<sup>U</sup>(m) <sup>o</sup>* <sup>∈</sup> <sup>R</sup>*qm*×*qm* , and with the sigmoid activation function *φ<sup>o</sup> <sup>σ</sup>* = *φσ* .

These gates have outputs in *(*0*,* 1*)*, and they determine the relative amount of memory that is updated and released in each step. The so-called *cell state process (c* [*m*] *<sup>t</sup> )t* is used to store the relevant memory. Given *<sup>z</sup>* [*m*−1] *<sup>t</sup>* , *<sup>z</sup>*[*m*] *<sup>t</sup>*−<sup>1</sup> and *<sup>c</sup>* [*m*] *<sup>t</sup>*−1, the updated cell state is defined by

$$\mathbf{c}\_{l}^{[m]} = \mathbf{c}^{(m)}\left(\mathbf{z}\_{l}^{[m-1]}, \mathbf{z}\_{l-1}^{[m]}, \mathbf{c}\_{l-1}^{[m]}\right) \tag{8.13}$$

$$= f\_{l}^{[m]} \odot \mathbf{c}\_{l-1}^{[m]} + \mathbf{i}\_{l}^{[m]} \odot \phi\_{\text{tanh}}\left(\left + \left\right) \tag{8.14} \in \mathbb{R}^{q\_{m}},$$

with the network weights *W(m) <sup>c</sup>* <sup>∈</sup> <sup>R</sup>*(qm*−1+1*)*×*qm* and *<sup>U</sup>(m) <sup>c</sup>* <sup>∈</sup> <sup>R</sup>*qm*×*qm* , and ) denotes the Hadamard product. This defines how the memory (cell state) is updated and passed forward using the forget and the input gates *<sup>f</sup>* [*m*] *<sup>t</sup>* and *<sup>i</sup>* [*m*] *<sup>t</sup>* , respectively. [*m*]

The neuron activations *z* [*m*] *<sup>t</sup>* are updated, given *<sup>z</sup>* [*m*−1] *<sup>t</sup>* , *<sup>z</sup> <sup>t</sup>*−<sup>1</sup> and *<sup>c</sup>* [*m*] *<sup>t</sup>* , by

$$\mathfrak{z}\_{l}^{[m]} = \mathfrak{z}^{(m)}\left(\mathfrak{z}\_{l}^{[m-1]}, \mathfrak{z}\_{l-1}^{[m]}, \mathfrak{c}\_{l}^{[m]}\right) = \mathfrak{o}\_{l}^{[m]} \odot \phi\left(\mathfrak{c}\_{l}^{[m]}\right) \; \in \; \mathbb{R}^{q\_{m}},\tag{8.14}$$

with the cell state *c* [*m*] *<sup>t</sup>* given in (8.13) and the output gate *<sup>o</sup>*[*m*] *<sup>t</sup>* defined in (8.12). Figure 8.6<sup>2</sup> shows a LSTM cell (8.13)–(8.14) which includes four RN layers (8.7) for the forget gate *f (m)*, the input gate *i(m)*, the output gate *o(m)* and in the cell state update (8.13). These RN layers are combined using the Hadamard product ) resulting in the updated cell state *c* [*m*] *<sup>t</sup>* and the learned representation *<sup>z</sup>*[*m*] *<sup>t</sup>* both being functions of the inputs *x*0:*<sup>t</sup>* .

<sup>2</sup> This figure is based on colah's blog explaining LSTMs https://colah.github.io/posts/2015-08- Understanding-LSTMs/.

**Fig. 8.6** LSTM cell *z(m)* with forget gate *φ<sup>f</sup> <sup>σ</sup>* , input gate *φ<sup>i</sup> <sup>σ</sup>* and output gate *φ<sup>o</sup> σ*

Below, we are going to summarize the LSTM cell update (8.13)–(8.14) as follows

$$\left(\mathbf{z}\_{l}^{[m-1]}, \mathbf{z}\_{t-1}^{[m]}, \mathbf{c}\_{t-1}^{[m]}\right) \ \mapsto \ \left(\mathbf{z}\_{l}^{[m]}, \mathbf{c}\_{l}^{[m]}\right) = \mathbf{z}^{\mathrm{LSTM}(m)} \ \left(\mathbf{z}\_{l}^{[m-1]}, \mathbf{z}\_{t-1}^{[m]}, \mathbf{c}\_{t-1}^{[m]}\right) . \tag{8.15}$$

The update (8.15) involves the eight network weight matrices *W(m) <sup>f</sup> , W(m) <sup>i</sup> , W(m) <sup>o</sup> , W(m) <sup>c</sup>* <sup>∈</sup> <sup>R</sup>*(qm*−1+1*)*×*qm* and *<sup>U</sup>(m) <sup>f</sup> , U(m) <sup>i</sup> , U(m) <sup>o</sup> , U(m) <sup>c</sup>* <sup>∈</sup> <sup>R</sup>*qm*×*qm* . Altogether we have 4*(qm*−<sup>1</sup> + 1 + *qm)qm* network parameters in each LSTM cell 1 ≤ *m* ≤ *d*. These are learned with the gradient descent method. Moreover, we need to initialize the LSTM cell update (8.15). From the previous layer *m* − 1 we have the input *z* [*m*−1] *<sup>t</sup>* which we initialize as *z*[0] *<sup>t</sup>* <sup>=</sup> *<sup>x</sup><sup>t</sup>* for *<sup>m</sup>* <sup>=</sup> 1 and *<sup>t</sup>* <sup>≥</sup> 0. The initial states *<sup>z</sup>*[*m*] <sup>0</sup> and *c* [*m*] <sup>0</sup> are usually set to zero.

## *8.3.2 Gated Recurrent Unit Network*

The LSTM architecture of the previous section seems quite complex and involves many parameters. Cho et al. [76] have introduced the GRU architecture that is simpler and uses less parameters, but has similar properties. The GRU architecture uses two gates that are defined as follows for *t* ≥ 1, see also (8.7):

#### 8.3 Special Recurrent Neural Networks 393

• The *reset gate* models the memory reset rate

$$\mathbf{r}\_{l}^{[m]} = r^{(m)}\left(\mathbf{z}\_{l}^{[m-1]}, \mathbf{z}\_{l-1}^{[m]}\right) = \phi\_{\sigma}^{r}\left(\left + \left\right) \in (0,1)^{q\_{m}},$$

with the network weights *W(m) <sup>r</sup>* <sup>∈</sup> <sup>R</sup>*(qm*−1+1*)*×*qm* and *<sup>U</sup>(m) <sup>r</sup>* <sup>∈</sup> <sup>R</sup>*qm*×*qm* , and with the sigmoid activation function *φ<sup>r</sup> <sup>σ</sup>* = *φσ* .

• The *update gate* models the memory update rate

$$\mathfrak{u}\_{l}^{[m]} = \mathfrak{u}^{(m)}\left(\mathfrak{z}\_{l}^{[m-1]}, \mathfrak{z}\_{l-1}^{[m]}\right) = \phi\_{\sigma}^{\mathfrak{u}}\left(\left + \left\right) \in (0,1)^{q\_{m}},$$

with the network weights *W(m) <sup>u</sup>* <sup>∈</sup> <sup>R</sup>*(qm*−1+1*)*×*qm* and *<sup>U</sup>(m) <sup>u</sup>* <sup>∈</sup> <sup>R</sup>*qm*×*qm* , and with the sigmoid activation function *φ<sup>u</sup> <sup>σ</sup>* = *φσ* .

The neuron activations *<sup>z</sup>*[*m*] *<sup>t</sup>* are updated, given *<sup>z</sup>* [*m*−1] *<sup>t</sup>* and *<sup>z</sup>* [*m*] *<sup>t</sup>*−1, by

$$z\_{t}^{\{m\}} = z^{\{m\}} \left( z\_{t}^{\{m-1\}}, z\_{t-1}^{\{m\}} \right) \tag{8.16}$$
 
$$= \mathbf{r}\_{t}^{\{m\}} \odot z\_{t-1}^{\{m\}} + (\mathbf{1} - \mathbf{r}\_{t}^{\{m\}}) \odot \phi \left( \langle W^{(m)}, \mathbf{z}\_{t}^{\{m-1\}} \rangle + \mathbf{u}\_{t}^{\{m\}} \odot \langle U^{(m)}, \mathbf{z}\_{t-1}^{\{m\}} \rangle \right) \in \mathbb{R}^{\otimes m},$$

with the network weights *<sup>W</sup>(m)* <sup>∈</sup> <sup>R</sup>*(qm*−1+1*)*×*qm* and *<sup>U</sup>(m)* <sup>∈</sup> <sup>R</sup>*qm*×*qm* , and for a general activation function *φ*.

The GRU and the LSTM architectures are similar, the former using less parameters because we do not explicitly model the cell state process. For an illustration of a GRU cell we refer to Fig. 8.7. In the sequel we focus on the LSTM architecture;

**Fig. 8.7** GRU cell *z(m)* with reset gate *φ<sup>r</sup> <sup>σ</sup>* and update gate *φ<sup>u</sup> σ*

though the GRU architecture is simpler and has less parameters, it is less robust in fitting.

## **8.4 Lab: Mortality Forecasting with RN Networks**

## *8.4.1 Lee–Carter Model, Revisited*

The mortality data has a natural time-series structure, and for this reason mortality forecasting is an obvious problem that can be studied within RN networks. For instance, the LC mortality model (7.63) involves a stochastic process *(kt)t* that needs to be extrapolated into the future. This extrapolation problem can be done in different ways. The original proposal of Lee and Carter [238] has been to analyze ARIMA time-series models, and to use standard statistical tools, Lee and Carter found that the random walk with drift gives a good stochastic description of the time index process *(kt)t* . Nigri et al. [286] proposed to fit a LSTM network to this stochastic process, this approach is also studied in Lindholm–Palmborg [252] where an efficient use of the mortality data for network fitting is discussed. These approaches still rely on the classical LC calibration using the SVD of Sect. 7.5.4, and the LSTM network is (only) used to extrapolate the LC time index process *(kt)t* .

More generally, one can design a RN network architecture that directly processes the raw mortality data *Mx,t* = *Dx,t/ex,t*, not specifically relying on the LC structure. This has been done in Richman–Wüthrich [316] using a FN network architecture, in Perla et al. [301] using a RN network and a convolutional neural (CN) network architecture, and in Schürch–Korn [330] extending this analysis to the study of prediction uncertainty using bootstrapping. A similar CN network approach has been taken by Wang et al. [375] interpreting the raw mortality data of Fig. 7.32 as an image.

#### **Lee–Carter Mortality Model: Random Walk with Drift Extrapolation**

We revisit the LC mortality model [238] presented in Sect. 7.5.4. The LC logmortality rate is assumed to have the following structure, see (7.63),

$$
\log(\mu\_{\ge,l}^{(p)}) = a\_{\ge}^{(p)} + b\_{\ge}^{(p)} k\_l^{(p)},
$$

for the ages *x*<sup>0</sup> ≤ *x* ≤ *x*<sup>1</sup> and for the calendar years *t* ∈ *T* . We now add the upper indices *(p)* to consider different populations *<sup>p</sup>*. The SVD gives us the estimates *<sup>a</sup>(p) <sup>x</sup>* , *k (p) <sup>t</sup>* and *<sup>b</sup>(p) <sup>x</sup>* based on the observed centered raw log-mortality rates, see Sect. 7.5.4. The SVD is applied to each population *p* separately, i.e., there is no interaction between the different populations. This approach allows us to fit a separate logmortality surface estimate *(*log*( <sup>μ</sup>(p) x,t ))x*0≤*x*≤*x*1;*t*∈*<sup>T</sup>* to each population *p*. Figure 7.33 shows an example for two populations *p*, namely, for Swiss females and for Swiss males.

The mortality forecasting requires to extrapolate the time index processes *( k (p) <sup>t</sup> )t*<sup>∈</sup>*<sup>T</sup>* beyond the latest observed calendar year *t*<sup>1</sup> = max{*T* }. As mentioned in Lee–Carter [238] a random walk with drift provides a suitable model for modeling *( k (p) <sup>t</sup> )t*≥<sup>0</sup> for many populations *p*, see Fig. 7.35 for the Swiss population. Assume that

$$
\widehat{k}\_{t+1}^{(p)} = \widehat{k}\_t^{(p)} + \varepsilon\_{t+1}^{(p)} \qquad t \ge 0,\tag{8.17}
$$

with *ε (p) t* <sup>i</sup>*.*i*.*d*.* <sup>∼</sup> *<sup>N</sup> (δp, σ*<sup>2</sup> *p)*, *<sup>t</sup>* <sup>≥</sup> 1, having drift *δp* <sup>∈</sup> <sup>R</sup> and variance *<sup>σ</sup>*<sup>2</sup> *<sup>p</sup> >* 0.

Model assumption (8.17) allows us to estimate the (constant) drift *δp* with MLE. For observations *( <sup>k</sup> (p) <sup>t</sup> )t*<sup>∈</sup>*<sup>T</sup>* we receive the log-likelihood function

$$\delta\_p \mapsto \ell\_{(\widehat{k}\_l^{(p)})\_{t \in \mathcal{T}}}(\delta\_p) = \sum\_{t=t\_0+1}^{t\_1} -\log(\sqrt{2\pi}\sigma\_p) - \frac{1}{2\sigma\_p^2} \left(\widehat{k}\_l^{(p)} - \widehat{k}\_{t-1}^{(p)} - \delta\_p\right)^2,$$

with first observed calendar year *t*<sup>0</sup> = min{*T* }. The MLE is given by

$$
\widehat{\delta}\_p^{\mathrm{ML.E}} = \frac{\widehat{k}\_{t\_1}^{(p)} - \widehat{k}\_{t\_0}^{(p)}}{t\_1 - t\_0}. \tag{8.18}
$$

This allows us to forecast the time index process for *t>t*<sup>1</sup> by

$$
\widehat{k}\_{t}^{(p)} = \widehat{k}\_{t\_1}^{(p)} + (t - t\_1)\widehat{\delta}\_p^{\mathsf{MLE}}.
$$

We explore this extrapolation for different Western European countries from the HMD [195]. We consider separately females and males of the countries {AUT, BE, CH, ESP, FRA, ITA, NL, POR}, thus, we choose 2 · 8 = 16 different populations *p*. For these countries we have observations for the ages 0 = *x*<sup>0</sup> ≤ *x* ≤ *x*<sup>1</sup> = 99 and for the calendar years 1950 <sup>≤</sup> *<sup>t</sup>* <sup>≤</sup> 2018.<sup>3</sup> For the following analysis we choose *T* = {*t*<sup>0</sup> ≤ *t* ≤ *t*1}={1950 ≤ *t* ≤ 2003}, thus, we fit the models on 54 years of mortality history. This fitted models are then extrapolated to the calendar years 2004 ≤ *t* ≤ 2018. These 15 calendar years from 2004 to 2018 allow us to perform an out-of-sample evaluation because we have the observations *M(p) x,t* <sup>=</sup> *<sup>D</sup>(p) x,t /e(p) x,t* for these years from the HMD [195].

Figure 8.8 shows the estimated time index process *( <sup>k</sup> (p) <sup>t</sup> )t*<sup>∈</sup>*<sup>T</sup>* to the left of the dotted lines, and to the right of the dotted lines we have the random walk with drift extrapolation *( <sup>k</sup> (p) <sup>t</sup> )t>t*<sup>1</sup> . The general observation is that, indeed, the random walk with drift seems to be a suitable model for *( <sup>k</sup> (p) <sup>t</sup> )t* . Moreover, there is a huge

<sup>3</sup> We exclude Germany from this consideration of (continental) Western European countries because the German mortality history is shorter due to the reunification in 1990.

**Fig. 8.8** Random walk with drift extrapolation of the time index process *( kt)t* for different countries and genders; the *y*-scale is the same in both plots

similarity between the different countries, only with the Netherlands (NL) being somewhat an outlier.

#### *Remarks 8.4*


• The LC model is fitted to each population *p* separately, without exploring any common structure across the populations. There are many multi-population extensions that try to learn common structure across different populations. We mention the common age effect (CAE) model of Kleinow [218], the augmented common factor (ACF) model of Li–Lee [249] and the functional time-series models of Hyndman et al. [196] and Shang–Haberman [334]. A direct multipopulation extension of the SVD matrix decomposition of the LC model is obtained by the tensor decomposition approaches of Russolillo et al. [325] and Dong et al. [110].

#### **Lee–Carter Mortality Model: LSTM Extrapolation**

Our aim here is to replace the individual random walk with drift extrapolations (8.17) by a common extrapolation across all considered populations *p*. For this we design a LSTM architecture. A second observation is that the increments *ε (p) <sup>t</sup>* <sup>=</sup> *<sup>k</sup> (p) <sup>t</sup>* <sup>−</sup> *<sup>k</sup> (p) <sup>t</sup>*−<sup>1</sup> have an average empirical auto-correlation (for lag 1) of <sup>−</sup>0*.*33. This clearly questions the Gaussian i.i.d. assumption in (8.17).

We first discuss the available data and we construct the input data. We have the time-series observations *( <sup>k</sup> (p) <sup>t</sup> )t*<sup>∈</sup>*<sup>T</sup>* , and the population index *p* = *(c, g)* has two categorical labels *c* for country and *g* for gender. We are going to use twodimensional embedding layers for these two categorical variables, see (7.31) for embedding layers. The time-series observations *( <sup>k</sup> (p) <sup>t</sup> )t*<sup>∈</sup>*<sup>T</sup>* will be pre-processed such that we do not simultaneously feed the entire time-series into the LSTM layer, but we divide them into shorter time-series. We will directly forecast the increments *ε (p) <sup>t</sup>* <sup>=</sup> *<sup>k</sup> (p) <sup>t</sup>* <sup>−</sup> *<sup>k</sup> (p) <sup>t</sup>*−<sup>1</sup> and not the time index process *( <sup>k</sup> (p) <sup>t</sup> )t*≥*<sup>t</sup>*0; in extrapolations with drift it is easier to forecast the increments with the networks. We choose a *lookback period* of *τ* = 3 calendar years, and we aim at predicting the response *Yt* = *ε (p) t* based on the time-series features *<sup>x</sup>t*−*<sup>τ</sup>* :*t*−<sup>1</sup> <sup>=</sup> *(ε(p) <sup>t</sup>*−*<sup>τ</sup> ,...,ε(p) <sup>t</sup>*−1*)*- <sup>∈</sup> <sup>R</sup>*<sup>τ</sup>* . This provides us with the following data structure for each population *p* = *(c, g)*:


Thus, each observation *Yt* = *ε (p) <sup>t</sup>* is equipped with the feature information *(t , c, g, xt*−*<sup>τ</sup>* :*t*−1*)*. As discussed in Lindholm–Palmborg [252], one should highlight that there is a dependence across *t*, since we have a diagonal cohort structure in the

features and the observations*(xt*−*<sup>τ</sup>* :*t*−<sup>1</sup>*, Yt)*. Usually, this dependence is not harmful in stochastic gradient descent fitting.

**Listing 8.1** LSTM architecture example

```
1 TS = layer_input(shape=c(lookback,1), dtype='float32', name='TS')
2 Country = layer_input(shape=c(1), dtype='int32', name='Country')
3 Gender = layer_input(shape=c(1), dtype='int32', name='Gender')
4 Time = layer_input(shape=c(1), dtype='float32', name='Time')
5 #
6 CountryEmb = Country %>%
7 layer_embedding(input_dim=8,output_dim=2,input_length=1,name='CountryEmb') %>%
8 layer_flatten(name='Country_flat')
9 #
10 GenderEmb = Gender %>%
11 layer_embedding(input_dim=2,output_dim=2,input_length=1,name='GenderEmb') %>%
12 layer_flatten(name='Gender_flat')
13 #
14 LSTM = TS %>%
15 layer_lstm(units=15,activation='tanh',recurrent_activation='sigmoid',
16 name='LSTM')
17 #
18 Output = list(LSTM,CountryEmb,GenderEmb,Time) %>% layer_concatenate() %>%
19 layer_dense(units=10, activation='tanh', name='FNLayer') %>%
20 layer_dense(units=1, activation='linear', name='Network')
21 #
22 model = keras_model(inputs = list(TS, Country, Gender, Time),
23 outputs = c(Output))
```
In Fig. 8.9 we plot the LSTM architecture used to forecast *ε (p) <sup>t</sup>* for *t>t*1, and Listing 8.1 gives the corresponding R code. We process the time-series *xt*−*<sup>τ</sup>* :*t*−<sup>1</sup> through a LSTM cell, see lines 14–16 of Listing 8.1. We choose a shallow LSTM network (*d* = 1) and therefore drop the upper index *m* = 1 in (8.15), but we add an upper index [LSTM] to highlight the output of the LSTM cell. This gives us the

**Fig. 8.9** LSTM architecture used to forecast *ε (p) <sup>t</sup>* for *t>t*<sup>1</sup>

LSTM cell updates for *t* − *τ* ≤ *s* ≤ *t* − 1

$$\left(\mathbf{x}\_s, \mathbf{z}\_{s-1}^{\text{[LSTM]}}, \mathbf{c}\_{s-1}\right) \; \mapsto \; \left(\mathbf{z}\_s^{\text{[LSTM]}}, \mathbf{c}\_s\right) = \mathbf{z}^{\text{LSTM}}\left(\mathbf{x}\_s, \mathbf{z}\_{s-1}^{\text{[LSTM]}}, \mathbf{c}\_{s-1}\right).$$

This LSTM recursion to process the time-series *xt*−*<sup>τ</sup>* :*t*−<sup>1</sup> gives us the LSTM output *z* [LSTM] *<sup>t</sup>*−<sup>1</sup> <sup>∈</sup> <sup>R</sup>*q*<sup>1</sup> , and it involves 4*(q*<sup>0</sup> <sup>+</sup> <sup>1</sup> <sup>+</sup> *<sup>q</sup>*1*)q*<sup>1</sup> <sup>=</sup> <sup>4</sup>*(*<sup>2</sup> <sup>+</sup> <sup>15</sup>*)*<sup>15</sup> <sup>=</sup> <sup>1</sup> 020 network parameters for the input dimension *q*<sup>0</sup> = 1 and the output dimension *q*<sup>1</sup> = 15.

For the categorical country code *c* and the binary gender *g* we choose twodimensional embedding layers, see (7.31),

$$c \mapsto \mathfrak{e}^{\mathbb{C}}(c) \in \mathbb{R}^2 \qquad \text{and} \qquad \mathfrak{g} \mapsto \mathfrak{e}^{\mathbb{G}}(\mathfrak{g}) \in \mathbb{R}^2,$$

these embedding maps give us 2*(*8 + 2*)* = 20 embedding weights. Finally, we concatenate the LSTM output *z*[LSTM] *<sup>t</sup>*−<sup>1</sup> <sup>∈</sup> <sup>R</sup>15, the embeddings *<sup>e</sup>*C*(c), <sup>e</sup>*G*(g)* <sup>∈</sup> <sup>R</sup><sup>2</sup> and the continuous calendar year variable *<sup>t</sup>* <sup>∈</sup> <sup>R</sup> and process this vector through a shallow FN network with *q*<sup>2</sup> = 10 neurons, see lines 18–20 of Listing 8.1. This FN layer gives us *(q*<sup>1</sup> + 2 + 2 + 1 + 1*)q*<sup>2</sup> = *(*15 + 2 + 2 + 1 + 1*)*10 = 210 parameters. Together with the output parameter of dimension *q*<sup>2</sup> + 1 = 11, we receive 1'261 network parameters to be fitted, which seems quite a lot.

To fit this model we have 8 · 2 = 16 populations, and for each population we have the observations *<sup>k</sup> (p) <sup>t</sup>* for the calendar years 1950 ≤ *t* ≤ 2003. Considering the increments *ε (p) <sup>t</sup>* and a lookback period of *τ* = 3 calendar years gives us 2003 − 1950−*τ* = 50 observations, rows in (8.19), per population *p*, thus, we have in total 800 observations. For the gradient descent fitting and the early stopping we choose a training to validation split of 8 : 2. As loss function we choose the squared error loss function. This implicitly implies that we assume that the increments *Yt* = *ε (p) <sup>t</sup>* are Gaussian distributed, or in other words, minimizing the squared error loss function means maximizing the Gaussian log-likelihood function. We then fit this model to the data using early stopping as described in (7.27). We analyze this fitted model. Figure 8.10 provides the learned embeddings for the country codes *c*. These learned embeddings have some similarity with the European map.

The final step is the extrapolation *kt* , *t>t*1. These updates need to be done recursively. We initialize for *t* = *t*<sup>1</sup> + 1 the time-series feature

$$\mathbf{x}\_{t\_1+1-\mathbf{r}:t\_1} = (\varepsilon\_{t\_1+1-\mathbf{r}}^{(p)}, \dots, \varepsilon\_{t\_1}^{(p)})^\top \in \mathbb{R}^\mathbf{r}.\tag{8.20}$$

Using the feature information *(t*<sup>1</sup> + 1*, c, g, xt*1+1−*<sup>τ</sup>* :*t*<sup>1</sup> *)* allows us to forecast the next increment *Yt*1+<sup>1</sup> = *ε (p) <sup>t</sup>*1+<sup>1</sup> by *<sup>Y</sup> <sup>t</sup>*1<sup>+</sup>1, using the fitted LSTM architecture of Fig. 8.9. Thus, this LSTM network allows us to perform a *one-period-ahead forecast* to receive

$$
\widehat{k}\_{l\_1+1} = \widehat{k}\_{l\_1} + \widehat{Y}\_{l\_1+1}.\tag{8.21}
$$

This update (8.21) needs to be iterated recursively. For the next period *t* = *t*<sup>1</sup> + 2 we set for the time-series feature

$$\mathbf{x}\_{t\_1+2-\mathbf{r}:t\_1+1} = (\varepsilon\_{t\_1+2-\mathbf{r}}^{(p)}, \dots, \varepsilon\_{t\_1}^{(p)}, \widehat{Y}\_{t\_1+1})^\top \in \mathbb{R}^r,\tag{8.22}$$

which gives us the next predictions *Y <sup>t</sup>*1+<sup>2</sup> and *kt*1<sup>+</sup>2, etc.

In Fig. 8.11 we present the extrapolation of *(ε(p) <sup>t</sup> )t* for Belgium females and males. The blue curve shows the observed increments *(ε(p) <sup>t</sup> )*1951≤*t*≤<sup>2003</sup> and the LSTM fitted (in-sample) values*(Y t)*<sup>1954</sup>≤*t*≤<sup>2003</sup> are in red color. Firstly, we observe a negative correlation (zig-zag behavior) in both the blue observations *(ε(p) <sup>t</sup> )*1951≤*t*≤<sup>2003</sup> and in their red estimated means *(Y t)*<sup>1954</sup>≤*t*≤2003. Thus, the LSTM finds this negative correlation (and it does not propose i.i.d. residuals). Secondly, the volatility in the

**Fig. 8.11** LSTM network extrapolation *(Y t)t>t*<sup>1</sup> for Belgium (BE) females and males

red curve is smaller than in the blue curve, the former relates to expected values and the latter to observations of the random variables (which should be more volatile). The light-blue color shows the random walk with drift extrapolation (which is just a horizontal straight line at level *<sup>δ</sup>*MLE *<sup>p</sup>* , see (8.18)). The orange color shows the LSTM extrapolation using the recursive one-period-ahead updates(8.20)–(8.22), which has a zig-zag behavior that vanishes over time. This vanishing behavior is critical and is going to be discussed next.

There is one issue with this recursive one-period-ahead updating algorithm. This updating algorithm is not fully consistent in how the data is being used. The original LSTM architecture calibration is based on the feature components *ε (p) <sup>t</sup>* , see (8.20). Since these increments are not known for the later periods *t>t*1, we replace their unknown values by the predictors, see (8.22). The subtle point here is that the predictors are on the level of expected values, and not on the level of random variables. Thus, *Y <sup>t</sup>* is typically less volatile than *<sup>ε</sup> (p) <sup>t</sup>* , but in (8.22) we pretend that we can use these predictors as a one-to-one replacement. A more consistent way would be to simulate/bootstrap *ε (p) <sup>t</sup>* from *N (Y t, σ*<sup>2</sup>*)* so that the extrapolation receives the same volatility as the original process. For simplicity we refrain from doing so, but Fig. 8.11 indicates that this would be a necessary step because the volatility in the orange curve is going to vanish after the calendar year 2003, i.e., the zig-zag behavior vanishes, which is clearly not appropriate.

The LSTM extrapolation of *( kt)t* is shown in Fig. 8.12. We observe quite some similarity to the random walk with drift extrapolation in Fig. 8.8, and, indeed, the random walk with drift seems to work very well (though the auto-correlation has not been specified correctly). Note that Fig. 8.8 is based on the individual extrapolations in *p*, whereas in Fig. 8.12 we have a common model for all populations.

Table 8.1 shows how often one model outperforms the other one (out-of-sample on calendar years 2004 ≤ *t* ≤ 2018 and per gender). On the male populations of

**Fig. 8.12** LSTM network extrapolation of *( kt)t* for different countries and genders

**Table 8.1** Comparison of the out-of-sample mean squared error losses for the calendar years 2004 ≤ *t* ≤ 2018: the numbers show how often one approach outperforms the other one on each gender


the 8 European countries both models outperform the other one 4 times, whereas for the female population the random walk with drift gives 5 times the better out-ofsample prediction. Of course, this seems disappointing for the LSTM approach. This observation is quite common, namely, that the deep learning approach outperforms the classical methods on complex problems. However, on simple problems, as the one here, we should go for a classical (simpler) model like a random walk with drift or an ARIMA model.

## *8.4.2 Direct LSTM Mortality Forecasting*

The previous section has been relying on the LC mortality model and only the extrapolation of the time-series *( kt)t* has been based on a RN network architecture. In this section we aim at directly processing the raw mortality rates *Mx,t* = *Dx,t/ex,t* through a network, thus, we perform the representation learning directly on the raw data. We therefore use a simplified version of the network architecture proposed in Perla et al. [301].

As input to the network we use the raw mortality rates *Mx,t*. We choose a lookback period of *τ* = 5 years and we define the time-series feature information to forecast the mortality in calendar year *t* by

$$\mathbf{x}\_{t-\tau:t-1} = (\mathbf{x}\_{t-\tau}, \dots, \mathbf{x}\_{t-1}) = \left(M\_{\mathbf{x},t}\right)\_{\mathbf{x} \le \mathbf{x} \le \mathbf{x}\_t, t-\tau \le t-1} \in \mathbb{R}^{(\mathbf{x}\_t - \mathbf{x}\_0 + 1)\times\mathbf{r}} = \mathbb{R}^{100\times\mathbf{5}}.\tag{8.23}$$

Thus, we directly process the raw mortality rates (simultaneously for all ages *x*) through the network architecture; in the corresponding R code we need to input the transposed features *x*- *<sup>t</sup>*−*<sup>τ</sup>* :*t*−<sup>1</sup> <sup>∈</sup> <sup>R</sup>5×100, see line 1 of Listing 8.2.

We choose a shallow LSTM network (*d* = 1) and drop the upper index *m* = 1 in (8.15). This gives us the LSTM cell updates for *t* − *τ* ≤ *s* ≤ *t* − 1

$$\mathbf{z}\left(\mathbf{x}\_s, \mathbf{z}\_{s-1}^{\text{[LSTM]}}, \mathbf{c}\_{s-1}\right) \; \mapsto \; \left(\mathbf{z}\_s^{\text{[LSTM]}}, \mathbf{c}\_s\right) = \mathbf{z}^{\text{LSTM}}\left(\mathbf{x}\_s, \mathbf{z}\_{s-1}^{\text{[LSTM]}}, \mathbf{c}\_{s-1}\right).$$

This LSTM recursion to process the time-series *xt*−*<sup>τ</sup>* :*t*−<sup>1</sup> gives us the LSTM output *z* [LSTM] *<sup>t</sup>*−<sup>1</sup> <sup>∈</sup> <sup>R</sup>*q*<sup>1</sup> , see lines 14–15 of Listing 8.2. It involves 4*(q*<sup>0</sup> <sup>+</sup> <sup>1</sup> <sup>+</sup> *<sup>q</sup>*1*)q*<sup>1</sup> <sup>=</sup> 4*(*100 + 1 + 20*)*20 = 9 680 network parameters for the input dimension *q*<sup>0</sup> = 100

**Fig. 8.13** LSTM architecture used to process the raw mortality rates *(Mx,t)x,t*

and the output dimension *q*<sup>1</sup> = 20. Many statisticians would probably stop at this point with this approach, as it seems highly over-parametrized. Let's see what we get.

For the categorical country code *c* and the binary gender *g* we choose two onedimensional embeddings, see (7.31),

$$\mathbf{c} \mapsto \mathbf{e}^{\mathbf{C}}(\mathbf{c}) \in \mathbb{R} \qquad \text{and} \qquad \mathbf{g} \mapsto \mathbf{e}^{\mathbf{G}}(\mathbf{g}) \in \mathbb{R}.\tag{8.24}$$

These embeddings give us 8 + 2 = 10 embedding weights. Figure 8.13 shows the LSTM cell in orange color and the embeddings in yellow color (in the colored version).

The LSTM output and the two embeddings are then concatenated to a learned representation *zt*−<sup>1</sup> = *(z* [LSTM] *<sup>t</sup>*−<sup>1</sup> *, <sup>e</sup>*C*(c), <sup>e</sup>*G*(g))*- <sup>∈</sup> <sup>R</sup>*q*1×1×<sup>1</sup> <sup>=</sup> <sup>R</sup>22. The 22 dimensional learned representation *zt*−<sup>1</sup> *encodes* the 500-dimensional input *<sup>x</sup>t*−*<sup>τ</sup>* :*t*−<sup>1</sup> <sup>∈</sup> <sup>R</sup>100×<sup>5</sup> and the two categorical variables *<sup>c</sup>* and *<sup>g</sup>*. The last step is to *decode* this representation *<sup>z</sup>t*−<sup>1</sup> <sup>∈</sup> <sup>R</sup><sup>22</sup> to predict the log-mortality rates *(Yx,t)*<sup>0</sup>≤*x*≤<sup>99</sup> <sup>=</sup> *(*log *Mx,t)*<sup>0</sup>≤*x*≤<sup>99</sup> <sup>∈</sup> <sup>R</sup>100, simultaneously for all ages *<sup>x</sup>*. This decoding is obtained by the code on lines 17–19 of Listing 8.2; this reads as

$$\mathbf{z}\_{t-1} \mapsto \left(\boldsymbol{\beta}\_{\boldsymbol{x}}^{0} + \boldsymbol{\beta}\_{\boldsymbol{x}}^{\mathrm{C}} \mathbf{e}^{\mathrm{C}}(\boldsymbol{c}) + \boldsymbol{\beta}\_{\boldsymbol{x}}^{\mathrm{G}} \mathbf{e}^{\mathrm{G}}(\boldsymbol{g}) + \left\langle \boldsymbol{\beta}\_{\boldsymbol{x}}, \boldsymbol{z}\_{t-1}^{\mathrm{[LSTM]}} \right\rangle \right)\_{0 \le \boldsymbol{x} \le 99}.\tag{8.25}$$

This decoding involves another *(*1 + 22*)*100 = 2 300 parameters *(β*<sup>0</sup> *<sup>x</sup> , β<sup>G</sup> <sup>x</sup> , β<sup>C</sup> x , βx)*<sup>0</sup>≤*x*≤99. Thus, altogether this LSTM network has *r* = 11 990 parameters.

Summarizing: the above architecture follows the philosophy of the auto-encoder of Sect. 7.5. A high-dimensional observation *(xt*−*<sup>τ</sup>* :*t*−1*, c, g)* is encoded to a lowdimensional bottleneck activation *<sup>z</sup>t*−<sup>1</sup> <sup>∈</sup> <sup>R</sup>22, which is then decoded by (8.25) to give the forecast *(Y x,t)*<sup>0</sup>≤*x*≤<sup>99</sup> for the log-mortality rates. It is not precisely an auto-encoder because the response is different from the input, as we forecast the log-mortality rates in the next calendar year *t* based on the information *zt*−<sup>1</sup> that

**Listing 8.2** LSTM architecture to directly process the raw mortality rates *(Mx,t)x,t*

```
1 TS = layer_input(shape=c(lookback,100), dtype='float32', name='TS')
2 Country = layer_input(shape=c(1), dtype='int32', name='Country')
3 Gender = layer_input(shape=c(1), dtype='int32', name='Gender')
4 Time = layer_input(shape=c(1), dtype='float32', name='Time')
5 #
6 CountryEmb = Country %>%
7 layer_embedding(input_dim=8,output_dim=1,input_length=1,name='CountryEmb') %>%
8 layer_flatten(name='Country_flat')
9 #
10 GenderEmb = Gender %>%
11 layer_embedding(input_dim=2,output_dim=1,input_length=1,name='GenderEmb') %>%
12 layer_flatten(name='Gender_flat')
13 #
14 LSTM = TS %>%
15 layer_lstm(units=20,activation='linear',recurrent_activation='sigmoid',
16 name='LSTM')
17 #
18 Output = list(LSTM,CountryEmb,GenderEmb) %>% layer_concatenate() %>%
19 layer_dense(units=100, activation='linear', name='scalarproduct') %>%
20 layer_reshape(c(1,100), name = 'Output')
21 #
22 model = keras_model(inputs = list(TS, Country, Gender),
23 outputs = c(Output))
```
is available at the end of the previous calendar year *t* − 1. In contrast to the LC mortality model, we no longer rely on the two-step approach by first fitting the parameters with a SVD, and performing a random walk with drift extrapolation. This encoder-decoder network performs both steps simultaneously.

We fit this network architecture to the available data. We have *r* = 11 990 network parameters. Based on a lookback period of *τ* = 5 years, we have 2003 − 1950−*τ*+1 = 49 observations per population *p* = *(c, g)*. Thus, we have in total 784 observations *xt*−*<sup>τ</sup>* :*t*−<sup>1</sup>*, c, g, (Yx,t)*<sup>0</sup>≤*x*≤<sup>99</sup> . We fit this network using the nadam version of the gradient descent algorithm. We choose a training to validation split of 8 : 2 and we explore 10'000 gradient descent epochs. A crucial observation is that the algorithm converges rather slowly and it does not show any signs of over-fitting, i.e., there is no strong need for the early stopping. This seems surprising because we have 11'990 parameters and only 784 observations. There are a couple of important ingredients that make this work. The features and observations themselves are high-dimensional, the low-dimensional encoding (compression) leads to a natural regularization, Moreover, this is combined with linear activation functions, see lines 15 and 19 of Listing 8.2. The gradient descent fitting has a certain inertness, and it seems that high-dimensional problems on comparably smooth high-dimensional data do not over-fit to individual components because the gradients are not very sensitive in the individual partial derivatives (in high dimensions). These highdimensional approaches only work if we have sufficiently many populations across which we can learn, here we have 16 populations, Perla et al. [301] even use 76 populations.

Since every gradient descent fit still involves several elements of randomness, we consider the nagging predictor (7.44), averaging over 10 fitted networks, see


**Table 8.2** Comparison of the out-of-sample mean squared losses for the calendar years 2004 ≤ *<sup>t</sup>* <sup>≤</sup> 2018; the figures are in 10−<sup>4</sup>

Sect. 7.4.4. The out-of-sample prediction results on the calendar years 2004 to 2018, i.e., *t>t*<sup>1</sup> = 2004, are presented in Table 8.2. These results verify the appropriateness of this LSTM approach. It outperforms the LC model on the female population in 6 out of 8 cases and on the male population on 7 out of 8 cases, only for the French population this LSTM approach seems to have some difficulties (compared to the LC model). Note that these are out-of-sample figures because the LSTM has only been fitted on the data prior to 2004. Moreover, we did not pre-process the raw mortality rates *Mx,t*, *t* ≤ 2003, and the prediction is done recursively in a one-period-ahead prediction approach, we also refer to (8.22). A more detailed analysis of the results shows that the LC and the LSTM approaches have a rather similar behavior for females. For males the LSTM prediction clearly outperforms the LC model prediction, this out-performance is across different ages *x* and different calendar years *t* ≥ 2004.

The advantage of this LSTM approach is that we can directly predict by processing the raw data. The disadvantage compared to the LC approach is that the LSTM network approach is more complex and more time-consuming. Moreover, unlike in the LC approach, we cannot (easily) assess the prediction uncertainty. In the LC approach the prediction uncertainty is obtained from assessing the uncertainty in the extrapolation and the uncertainty in the parameter estimates, e.g., using a bootstrap. The LSTM approach is not sufficiently robust (at least not on our data) to provide any reasonable uncertainty estimates.

We close this section and example by analyzing the functional form of the decoder (8.25). We observe that this decoder has much similarity with the LC model assumption (7.63)

$$\begin{aligned} \widehat{Y}\_{\boldsymbol{x},t} &= \boldsymbol{\beta}\_{\boldsymbol{x}}^{0} + \boldsymbol{\beta}\_{\boldsymbol{x}}^{\boldsymbol{C}} \mathbf{e}^{\boldsymbol{C}}(\boldsymbol{c}) + \boldsymbol{\beta}\_{\boldsymbol{x}}^{\boldsymbol{G}} \mathbf{e}^{\boldsymbol{G}}(\boldsymbol{g}) + \left< \boldsymbol{\mathcal{B}}\_{\boldsymbol{x}}, \boldsymbol{z}\_{t-1}^{\text{[LSTM]}} \right>, \\ \log(\boldsymbol{\mu}\_{\boldsymbol{x},t}^{(p)}) &= a\_{\boldsymbol{x}}^{(p)} + b\_{\boldsymbol{x}}^{(p)} k\_{t}^{(p)}. \end{aligned}$$

The LC model considers the average force of mortality *a(p) <sup>x</sup>* <sup>∈</sup> <sup>R</sup> for each population *<sup>p</sup>* <sup>=</sup> *(c, g)* and each age *<sup>x</sup>*; the LSTM architecture has the same term *<sup>β</sup>*<sup>0</sup> *x*+*β<sup>C</sup> <sup>x</sup> <sup>e</sup>*C*(c)*<sup>+</sup> *β<sup>G</sup> <sup>x</sup> e*G*(g)*. In the LC model, the change of force of mortality is considered by a population-dependent term *b(p) <sup>x</sup> k (p) <sup>t</sup>* , whereas the LSTM architecture has a term *β<sup>x</sup> , z* [LSTM] *<sup>t</sup>*−<sup>1</sup> . This latter term is also population-dependent because the LSTM cell directly processes the raw mortality data *Mx,t* coming from the different populations *p*. Note that this is the only time-*t*-dependent term in the LSTM architecture. We conclude that the main difference between these two forecast approaches is how the past mortality observations are processed. Apart from that the general structure is the same.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 9 Convolutional Neural Networks**

The previous two chapters have been considering fully-connected feed-forward neural (FN) networks and recurrent neural (RN) networks. Fully-connected FN networks are the prototype of networks for deep representation learning on tabular data. This type of networks extracts *global properties* from the features *x*. RN networks are an adaption of FN networks to time-series data. Convolutional neural (CN) networks are a third type of networks, and their specialty is to extract *local structure* from the features. Originally, they have been introduced for speech and image recognition aiming at finding similar structure in different parts of the feature *x*. For instance, if *x* is a picture consisting of pixels, and if we want to classify this picture according to its contents, then we try to find similar structure (objects) in different locations of this picture. CN networks are suitable for this task as they work with filters (kernels) that have a fixed window size. These filters then screen across the picture to detect similar local structure at different locations in the picture. CN networks were introduced in the 1980s by Fukushima [145] and LeCun et al. [234, 235], and they have been celebrating great success in many applications. Our introduction to CN networks is based on the tutorial of Meier– Wüthrich [269]. For real data applications there are many pre-trained CN network libraries that can be downloaded and used for several different tasks, an example for image recognition is the AlexNet of Krizhevsky et al. [226].

## **9.1 Plain-Vanilla Convolutional Neural Network Layer**

Structurally, the CN network architectures are similar to the FN network architectures, only they replace certain FN layers by CN layers. Therefore, we start by introducing the CN layer, and one should keep the structure of the FN layer (7.5) in mind. In a nutshell, FN layers consider non-linearly activated inner products *<sup>w</sup>(m) <sup>j</sup> , z* , and CN layers replace these inner products by a type of convolution *W(m) <sup>j</sup>* ∗ *z*.

## *9.1.1 Input Tensors and Channels*

We start from an *input tensor <sup>z</sup>* <sup>∈</sup> <sup>R</sup>*q(*1*)*×···×*q(K)* that has dimension *<sup>q</sup>(*1*)*×···×*q(K)*. This input tensor *<sup>z</sup>* is a *multi-dimensional array of order (length) <sup>K</sup>* <sup>∈</sup> <sup>N</sup> and with elements *zi*1*,...,iK* <sup>∈</sup> <sup>R</sup> for 1 <sup>≤</sup> *ik* <sup>≤</sup> *<sup>q</sup>(k)* and 1 <sup>≤</sup> *<sup>k</sup>* <sup>≤</sup> *<sup>K</sup>*. The special case of order *<sup>K</sup>* <sup>=</sup> 2 is a matrix *<sup>z</sup>* <sup>∈</sup> <sup>R</sup>*q(*1*)* <sup>×</sup>*q(*2*)* . This matrix can illustrate a black and white image of dimension *<sup>q</sup>(*1*)* <sup>×</sup> *<sup>q</sup>(*2*)* with the matrix entries *zi*1*,i*<sup>2</sup> <sup>∈</sup> <sup>R</sup> describing the intensities of the gray scale in the corresponding pixels *(i*1*, i*2*)*. A color image typically has the three color channels Red, Green and Blue (RGB), and such a RGB image can be represented by a tensor *<sup>z</sup>* <sup>∈</sup> <sup>R</sup>*q(*1*)* <sup>×</sup>*q(*2*)*×*q(*3*)* of order 3 with *<sup>q</sup>(*1*)* <sup>×</sup> *<sup>q</sup>(*2*)* being the dimension of the image and *<sup>q</sup>(*3*)* <sup>=</sup> 3 describing the three color channels, i.e., *(zi*1*,i*2*,*1*, zi*1*,i*2*,*2*, zi*1*,i*2*,*3*)* - <sup>∈</sup> <sup>R</sup><sup>3</sup> describes the intensities of the colors RGB in the pixel *(i*1*, i*2*)*.

Typically, the structure of black and white images and RGB images is unified by representing the black and white picture by a tensor *<sup>z</sup>* <sup>∈</sup> <sup>R</sup>*q(*1*)* <sup>×</sup>*q(*2*)* <sup>×</sup>*q(*3*)* of order 3 with a single channel *<sup>q</sup>(*3*)* <sup>=</sup> 1. This philosophy is going to be used throughout this chapter. Namely, if we consider a tensor *<sup>z</sup>* <sup>∈</sup> <sup>R</sup>*q(*1*)* ×···×*q(K*−1*)*×*q(K)* of order *K*, the first *K* − 1 components *(i*1*,...,iK*−1*)* will play the role of the *spatial components* that have a natural topology, and the last components 1 <sup>≤</sup> *iK* <sup>≤</sup> *<sup>q</sup>(K)* are called the *channels* reflecting, e.g., a gray scale (for *<sup>q</sup>(K)* <sup>=</sup> 1) or the RGB intensities (for *<sup>q</sup>(K)* <sup>=</sup> 3).

In Sect. 9.1.3, below, we will also study time-series data where we have 2nd order tensors (matrices). The first component reflects time 1 <sup>≤</sup> *<sup>t</sup>* <sup>≤</sup> *<sup>q</sup>(*1*)* , i.e., the spatial component is temporal for time-series data, and the second component (channels) describes the different elements *z<sup>t</sup>* = *(zt ,*1*,...,zt ,q(*2*))*- <sup>∈</sup> <sup>R</sup>*q(*2*)* that are measured/observed at each time point *t*.

## *9.1.2 Generic Convolutional Neural Network Layer*

We start from an input tensor *<sup>z</sup>* <sup>∈</sup> <sup>R</sup>*q(*1*) <sup>m</sup>*−1×···×*q(K) <sup>m</sup>*−<sup>1</sup> of order *K*. The first *K* − 1 components of this tensor have a spatial structure and the *K*-th component stands for the channels. A CN layer applies (local) convolution operations to this tensor. We choose a *filter size*, also called *window size* or *kernel size*, *(f (*1*) <sup>m</sup> ,...,f (K) <sup>m</sup> )*- <sup>∈</sup> <sup>N</sup>*<sup>K</sup>* with *f (k) <sup>m</sup>* <sup>≤</sup> *<sup>q</sup>(k) <sup>m</sup>*−1, for 1 <sup>≤</sup> *<sup>k</sup>* <sup>≤</sup> *<sup>K</sup>* <sup>−</sup>1, and *<sup>f</sup> (K) <sup>m</sup>* <sup>=</sup> *<sup>q</sup>(K) <sup>m</sup>*−1. This filter size determines the output dimension of the CN operation by

$$q\_m^{(k)} \stackrel{\text{def.}}{=} q\_{m-1}^{(k)} - f\_m^{(k)} + 1,\tag{9.1}$$

for 1 ≤ *k* ≤ *K*. Thus, the size of the image is reduced by the window size of the filter. In particular, the output dimension of the channels component *k* = *K* is *q(K) <sup>m</sup>* = 1, i.e., all channels are compressed to a scalar output. The spatial components 1 ≤ *k* ≤ *K* − 1 retain their spatial structure but the dimension is reduced according to (9.1).

A *CN operation* is a mapping (note that the order of the tensor is reduced from *K* to *K* − 1 because the channels are compressed; index *j* is going to be explained later)

$$\mathfrak{z}\_{j}^{(m)}: \mathbb{R}^{q\_{m-1}^{(1)} \times \cdots \times q\_{m-1}^{(K)}} \to \mathbb{R}^{q\_{m}^{(1)} \times \cdots \times q\_{m}^{(K-1)}}\_{m} \tag{9.2}$$

$$\mathfrak{z} \mapsto \mathfrak{z}\_{j}^{(m)}(\mathfrak{z}) = \left( \mathfrak{z}\_{i\_{1}, \ldots, i\_{K-1}, j}^{(m)}(\mathfrak{z}) \right)\_{1 \le i\_{k} \le q\_{m}^{(k)}; 1 \le k \le K-1},$$

taking the values for a fixed activation function *<sup>φ</sup>* : <sup>R</sup> <sup>→</sup> <sup>R</sup>

$$z\_{i\_1,\ldots,i\_{K-1};j}^{(m)}(\mathbf{z}) = \phi\left(w\_{0,j}^{(m)} + \sum\_{l\_1=1}^{f\_m^{(1)}} \cdots \sum\_{l\_K=1}^{f\_m^{(K)}} w\_{l\_1,\ldots,l\_K;j}^{(m)} z\_{l\_1+l\_1-1,\ldots,l\_{K-1}+l\_K-1,l\_K}\right),\tag{9.3}$$

for given intercept *w(m)* <sup>0</sup>*,j* <sup>∈</sup> <sup>R</sup> and *filter weights*

$$\mathbf{W}\_j^{(m)} = \left( w\_{l\_1, \ldots, l\_K; j}^{(m)} \right)\_{1 \le l\_k \le f\_m^{(k)}; 1 \le k \le K} \in \mathbb{R}^{f\_m^{(1)} \times \cdots \times f\_m^{(K)}};\tag{9.4}$$

the network parameter has dimension *rm* <sup>=</sup> <sup>1</sup> <sup>+</sup> <sup>D</sup>*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> *<sup>f</sup> (k) <sup>m</sup>* .

At first sight this CN operation looks quite complicated. Let us give some remarks that allow for a better understanding and a more compact notation. The operation in (9.3) chooses the corner *(i*1*,...,iK*−1*,* 1*)* as base point, and then it reads the tensor elements in the (discrete) window

$$(i\_1, \ldots, i\_{K-1}, 1) + \left[0: f\_m^{(1)} - 1\right] \times \cdots \times \left[0: f\_m^{(K-1)} - 1\right] \times \left[0: f\_m^{(K)} - 1\right],\tag{9.5}$$

with given filter weights *W(m) <sup>j</sup>* . This window is then moved across the entire tensor *z* by changing the base point *(i*1*,...,iK*−1*,* 1*)* accordingly, but with fixed filter weights *W(m) <sup>j</sup>* . This operation resembles a convolution, however, in (9.3) the indices in *zi*1+*l*1−1*,...,iK*−1+*lK*−1−1*,lK* run in reverse direction compared to a classical (mathematical) convolution. By a slight abuse of notation, nevertheless, we use the symbol of the convolution operator ∗ to abbreviate (9.2). This gives us the compact notation:

$$\mathbf{z}\_{j}^{(m)}: \mathbb{R}^{q\_{m-1}^{(1)} \times \cdots \times q\_{m-1}^{(K)}} \to \mathbb{R}^{q\_{m}^{(1)} \times \cdots \times q\_{m}^{(K-1)}}$$

$$\mathbf{z} \mapsto \mathbf{z}\_{j}^{(m)}(\mathbf{z}) = \phi \left( w\_{0,j}^{(m)} + W\_{j}^{(m)} \ast \mathbf{z} \right), \qquad (9.6)$$

having the activations for <sup>1</sup> <sup>≤</sup> *ik* <sup>≤</sup> *<sup>q</sup>(k) <sup>m</sup>* , 1 ≤ *k* ≤ *K* − 1,

$$
\phi \left( w\_{0,j}^{(m)} + W\_j^{(m)} \ast z \right)\_{i\_1, \dots, i\_{K-1}} = z\_{i\_1, \dots, i\_{K-1}; j}^{(m)}(z),
$$

where the latter is given by (9.3).

#### *Remarks 9.1*


This understanding now allows us to define a CN layer. Note that the mappings (9.6) have a lower index *j* which indicates that this is one single projection (filter extraction), called a *filter*. By choosing multiple different filters *(w(m)* <sup>0</sup>*,j , <sup>W</sup>(m) <sup>j</sup> )*, we can define the CN layer as follows.

Choose *q(K) <sup>m</sup>* <sup>∈</sup> <sup>N</sup> filters, each having a *rm*-dimensional filter weight *(w(m)* <sup>0</sup>*,j , <sup>W</sup>(m) <sup>j</sup> )*, <sup>1</sup> <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>q</sup>(K) <sup>m</sup>* . A *CN layer* is a mapping

$$\mathbf{z}^{(m)}: \mathbb{R}^{q\_{m-1}^{(1)} \times \cdots \times q\_{m-1}^{(K)}} \to \mathbb{R}^{q\_{m}^{(1)} \times \cdots \times q\_{m}^{(K)}}\_{m} \tag{9.7}$$

$$\mathbf{z} \mapsto \mathbf{z}^{(m)}(\mathbf{z}) = \left( \mathbf{z}\_{1}^{(m)}(\mathbf{z}), \ldots, \mathbf{z}\_{q\_{m}^{(K)}}^{(m)}(\mathbf{z}) \right),$$

$$\dots \qquad \dots \qquad \dots \tag{9.8}$$

with filters *z (m) <sup>j</sup> (z)* <sup>∈</sup> <sup>R</sup>*q(*1*) <sup>m</sup>* ×···×*q(K*−1*) <sup>m</sup>* , <sup>1</sup> <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>q</sup>(K) <sup>m</sup>* , given by (9.6).

A CN layer (9.7) converts the *q(K) <sup>m</sup>*−<sup>1</sup> input channels to *<sup>q</sup>(K) <sup>m</sup>* output filters by preserving the spatial structure on the first *K* − 1 components of the input tensor *z*. More mathematically, CN layers and networks have been studied, among others, by Zhang et al. [403, 404], Mallat [263] and Wiatowski–Bölcskei [382]. These authors prove that CN networks have certain translation invariance properties and deformation stability. This exactly explains why these networks allow one to recognize similar objects at different locations in the input tensor. Basically, by translating the filter windows (9.5) across the tensor, we try to extract the local structure from the tensor that provides similar signals in different locations of that tensor. Thinking of an image where we try to recognize, say, a dog, such a dog can be located at different sites in the image, and a filter (window) that moves across that image tries to locate the dogs in the image.

A CN layer (9.7) defines one layer indexed by the upper index *(m)*, and for deep representation learning we now have to compose multiple of these CN layers, but we can also compose CN layers with FN layers or RN layers. Before doing so, we need to introduce some special purpose layers and tools that are useful for CN network modeling, this is done in Sect. 9.2, below.

## *9.1.3 Example: Time-Series Analysis and Image Recognition*

Most CN network examples are based on time-series data or images. The former has a 1-dimensional temporal component, and the latter has a 2-dimensional spatial component. Thus, these two examples are giving us tensors of orders *K* = 2 and *K* = 3, respectively. We briefly discuss such examples as specific applications of a tensors of a general order *K* ≥ 2.

#### **Time-Series Analysis with CN Networks**

For a time-series analysis we often have observations *<sup>x</sup><sup>t</sup>* <sup>∈</sup> <sup>R</sup>*q*<sup>0</sup> for the time points 0 ≤ *t* ≤ *T* . Bringing this time-series data into a tensor form gives us

$$\mathbf{x} = \mathbf{x}\_{0:T}^{\top} = (\mathbf{x}\_0, \dots, \mathbf{x}\_T)^{\top} \in \mathbb{R}^{(T+1)\times q\_0} = \mathbb{R}^{q\_0^{(1)}\times q\_0^{(2)}},$$

with *q(*1*)* <sup>0</sup> <sup>=</sup> *<sup>T</sup>* <sup>+</sup> <sup>1</sup> and *<sup>q</sup>(*2*)* <sup>0</sup> = *q*0. We have met such examples in Chap. 8 on RN networks. Thus, for time-series data the input to a CN network is a tensor of order *K* = 2 with a temporal component having the dimension *T* + 1 and at each time point *<sup>t</sup>* we have *<sup>q</sup>*<sup>0</sup> measurements (channels) *<sup>x</sup><sup>t</sup>* <sup>∈</sup> <sup>R</sup>*q*<sup>0</sup> . A CN network tries to find similar structure at different time points in this time-series data *x*0:*<sup>T</sup>* . For a first CN layer *<sup>m</sup>* <sup>=</sup> 1 we therefore choose *<sup>q</sup>*<sup>1</sup> <sup>∈</sup> <sup>N</sup> filters and consider the mapping

$$\mathbf{z}^{(1)}: \mathbb{R}^{(T+1)\times q\_0} \to \mathbb{R}^{(T-f\_1+2)\times q\_1} \tag{9.8}$$

$$\mathbf{x}\_{0:T}^{\top} \mapsto \mathbf{z}^{(1)}(\mathbf{x}\_{0:T}^{\top}) = \left(\mathbf{z}\_1^{(1)}(\mathbf{x}\_{0:T}^{\top}), \dots, \mathbf{z}\_{q\_1}^{(1)}(\mathbf{x}\_{0:T}^{\top})\right),$$

with filters *z (*1*) <sup>j</sup> (x*- <sup>0</sup>:*<sup>T</sup> )* <sup>∈</sup> <sup>R</sup>*<sup>T</sup>* <sup>−</sup>*f*1<sup>+</sup>2, 1 <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>q</sup>*1, given by (9.6) and for a fixed window size *<sup>f</sup>*<sup>1</sup> <sup>∈</sup> <sup>N</sup>. From (9.8) we observe that the length of the time-series is reduced from *T* + 1 to *T* − *f*<sup>1</sup> + 2 accounting for the window size *f*1. In financial mathematics, a structure (9.8) is often called a rolling window that moves across the time-series *x*0:*<sup>T</sup>* and extracts the corresponding information.

We have introduced two different architectures to process time-series information *x*0:*<sup>T</sup>* , and these different architectures serve different purposes. A RN network architecture is most suitable if we try to forecast the next response of a timeseries. I.e., we typically process the past observations through a recurrent structure to predict the next response, this is the motivation, e.g., behind Figs. 8.4 and 8.5. The motivation for the use of a CN network architecture is different as we try to find similar structure at different times, e.g., in a financial time-series we may be interested in finding the downturns of more than 20%. The latter is a local analysis which is explored by local filters (of a finite window size).

#### **Image Recognition**

Image recognition extends (9.8) by one order to a tensor of order *K* = 3. Typically, we have images of dimensions (pixels) *I* ×*J* , and having three color channels RGB. These images then read as

$$\mathbf{x} = (\mathbf{x}\_1, \mathbf{x}\_2, \mathbf{x}\_3) \in \mathbb{R}^{I \times J \times \mathfrak{J}} = \mathbb{R}^{q\_0^{(1)} \times q\_0^{(2)} \times q\_0^{(3)}},$$

where *<sup>x</sup>*<sup>1</sup> <sup>∈</sup> <sup>R</sup>*I*×*<sup>J</sup>* is the intensity of red, *<sup>x</sup>*<sup>2</sup> <sup>∈</sup> <sup>R</sup>*I*×*<sup>J</sup>* is the intensity of green, and *<sup>x</sup>*<sup>3</sup> <sup>∈</sup> <sup>R</sup>*I*×*<sup>J</sup>* is the intensity of blue.

Chose a window size of *f (*1*)* <sup>1</sup> <sup>×</sup> *<sup>f</sup> (*2*)* <sup>1</sup> and *<sup>q</sup>*<sup>1</sup> <sup>∈</sup> <sup>N</sup> filters to receive the CN layer

$$\mathfrak{L}^{(1)}: \mathbb{R}^{I \times J \times 3} \to \mathbb{R}^{(I - f\_1^{(1)} + 1) \times (J - f\_1^{(2)} + 1) \times q\_1} \tag{9.9}$$

$$\mathbf{z}^{(1)}(\mathbf{x}\_1, \mathbf{x}\_2, \mathbf{x}\_3) \mapsto \mathbf{z}^{(1)}(\mathbf{x}\_1, \mathbf{x}\_2, \mathbf{x}\_3) = \left( \mathbf{z}\_1^{(1)}(\mathbf{x}\_1, \mathbf{x}\_2, \mathbf{x}\_3), \dots, \mathbf{z}\_{q\_1}^{(1)}(\mathbf{x}\_1, \mathbf{x}\_2, \mathbf{x}\_3) \right), \dots$$

with filters *z (*1*) <sup>j</sup> (x*1*, <sup>x</sup>*2*, <sup>x</sup>*3*)* <sup>∈</sup> <sup>R</sup>*(I*−*<sup>f</sup> (*1*)* <sup>1</sup> <sup>+</sup>1*)*×*(J*−*<sup>f</sup> (*2*)* <sup>1</sup> +1*)* , 1 ≤ *j* ≤ *q*1. Thus, we compress the 3 channels in each filter *j* , but we preserve the spatial structure of the image (by the convolution operation ∗).

For black and white pictures which only have one color channel, we preserve the spatial structure of the picture, and we modify the input tensor to a tensor of order 3 and of the form

$$\mathbf{x} = (\mathbf{x}\_{\mathbb{I}}) \in \mathbb{R}^{I \times J \times \mathbb{I}}.$$

## **9.2 Special Purpose Tools for Convolutional Neural Networks**

## *9.2.1 Padding with Zeros*

We have seen that the CN operation reduces the size of the output by the filter sizes, see (9.1). Thus, if we start from an image of size 100 × 50 × 1, and if the filter sizes are given by *<sup>f</sup> (*1*) <sup>m</sup>* <sup>=</sup> *<sup>f</sup> (*2*) <sup>m</sup>* <sup>=</sup> 9, then the output will be of dimension <sup>92</sup> <sup>×</sup> <sup>42</sup> <sup>×</sup> *<sup>q</sup>(*3*)* <sup>1</sup> , see (9.9). Sometimes, this reduction in dimension is impractical, and padding helps to keep the original shape. Padding a tensor *z* with *p(k) <sup>m</sup>* parameters, 1 ≤ *k* ≤ *K* −1, means that the tensor is extended in all *K* −1 spatial directions by (typically) adding zeros of that size, so that the padded tensor has dimension

$$\left(p\_m^{(1)} + q\_{m-1}^{(1)} + p\_m^{(1)}\right) \times \dots \times \left(p\_m^{(K-1)} + q\_{m-1}^{(K-1)} + p\_m^{(K-1)}\right) \times q\_{m-1}^{(K)}.$$

This implies that the output filters will have the dimensions

$$q\_m^{(k)} = q\_{m-1}^{(k)} + 2p\_m^{(k)} - f\_m^{(k)} + 1,$$

for 1 ≤ *k* ≤ *K* − 1. The spatial dimension of the original tensor size is preserved if 2*p(k) <sup>m</sup>* <sup>−</sup> *<sup>f</sup> (k) <sup>m</sup>* + 1 = 0. Padding does not add any additional parameters, but it is only used to reshape the tensors.

## *9.2.2 Stride*

Strides are used to skip part of the input tensor *z* in order to reduce the size of the output. This may be useful if the input tensor is a very high resolution image. Choose the stride parameters *s (k) <sup>m</sup>* , 1 ≤ *k* ≤ *K* − 1. We can then replace the summation in (9.3) by the following term

$$\sum\_{l\_1=1}^{f\_m^{(1)}} \cdots \sum\_{l\_K=1}^{f\_m^{(K)}} w\_{l\_1,\dots,l\_K;j}^{(m)} \, \, \, \, \, \_{s\_m^{(1)}} (l\_1-1) + l\_1, \dots, s\_m^{(K-1)} (l\_{K-1}-1) + l\_{K-1}, l\_K \, \, \, \, \, \, \, \, \, \_{s\_m^{(1)}}$$

This only extracts the tensor entries on a discrete grid of the tensor by translating the window by multiples of integers, see also (9.5),

$$\left(s\_m^{(1)}(i\_1 - 1), \ldots, s\_m^{(K-1)}(i\_{K-1} - 1), 1\right) + \left[1: f\_m^{(1)}\right] \times \cdots \times \left[1: f\_m^{(K-1)}\right] \times \left[0: f\_m^{(K)} - 1\right],$$

and the size of the output is reduced correspondingly. If we choose strides *s (k) m* = *f (k) <sup>m</sup>* , 1 ≤ *k* ≤ *K* − 1, we receive a partition of the spatial part of the input tensor *z*, this is going to be used in the max-pooling layer (9.11).

## *9.2.3 Dilation*

Dilation is similar to stride, though, different in that it enlarges the filter sizes instead of skipping certain positions in the input tensor. Choose the dilation parameters *e (k) <sup>m</sup>* , 1 ≤ *k* ≤ *K* − 1. We can then replace the summation in (9.3) by the following term

$$\sum\_{l\_1=1}^{f\_m^{(1)}} \cdots \sum\_{l\_K=1}^{f\_m^{(K)}} w\_{l\_1,\ldots,l\_K;j}^{(m)} z\_{i\_1+\epsilon\_m^{(1)}(l\_1-1),\ldots,i\_{K-1}+\epsilon\_m^{(K-1)}(l\_{K-1}-1),l\_K}$$

This applies the filter weights to the tensor entries on discrete grids

$$\left[\left(i\_1,\ldots,i\_{K-1},1\right)+\epsilon\_m^{(1)}\left[0:f\_m^{(1)}-1\right]\times\cdots\times\epsilon\_m^{(K-1)}\left[0:f\_m^{(K-1)}-1\right]\times\left[0:f\_m^{(K)}-1\right],\ldots\right]$$

where the intervals *e (k) <sup>m</sup>* [<sup>0</sup> : *<sup>f</sup> (k) <sup>m</sup>* − 1] run over the grids of span sizes *e (k) <sup>m</sup>* , 1 ≤ *k* ≤ *K* −1. Thus, in comparably smoothing images we do not read all the pixels but only every *e (k) <sup>m</sup>* -th pixel in the window. Also this reduces the size of the output tensor.

## *9.2.4 Pooling Layer*

As we have seen above, the dimension of the tensor is reduced by the filter size in each spatial direction if we do not apply padding with zeros. In general, deep representation learning follows the paradigm of auto-encoding by reducing a high-dimensional input to a low-dimensional representation. In CN networks this is usually (efficiently) done by so-called pooling layers. In spirit, pooling layers work similarly to CN layers (having a fixed window size), but we do not apply a convolution operation ∗, but rather a maximum operation to the window to extract the dominant tensor elements.

We choose a fixed window size *(f (*1*) <sup>m</sup> ,...,f (K*−1*) <sup>m</sup> )*- <sup>∈</sup> <sup>N</sup>*K*−<sup>1</sup> and strides *<sup>s</sup> (k) m* = *f (k) <sup>m</sup>* , 1 ≤ *k* ≤ *K* − 1, for the spatial components of the tensor *z* of order *K*. A *max-pooling layer* is given by

$$\mathbf{z}^{(m)}: \mathbb{R}^{q\_{m-1}^{(1)} \times \cdots \times q\_{m-1}^{(K)}} \to \mathbb{R}^{q\_m^{(1)} \times \cdots \times q\_m^{(K)}}$$

$$\mathbf{z} \mapsto \mathbf{z}^{(m)}(\mathbf{z}) = \text{MaxPool}(\mathbf{z}), \tag{9.10}$$

with dimensions *q(K) <sup>m</sup>* <sup>=</sup> *<sup>q</sup>(K) <sup>m</sup>*−<sup>1</sup> and for <sup>1</sup> <sup>≤</sup> *<sup>k</sup>* <sup>≤</sup> *<sup>K</sup>* <sup>−</sup> <sup>1</sup>

$$q\_{m}^{(k)} = \left\lfloor q\_{m-1}^{(k)} / f\_{m}^{(k)} \right\rfloor,\tag{9.11}$$

having the activations for <sup>1</sup> <sup>≤</sup> *ik* <sup>≤</sup> *<sup>q</sup>(k) <sup>m</sup>* , 1 ≤ *k* ≤ *K*,

$$\text{MaxPool}(\mathfrak{z})\_{i\_1,\dots,i\_K} = \max\_{\substack{1 \le l\_k \le \binom{k}{m} \\ 1 \le k \le K-l}} z\_{f\_m^{(1)}(i\_1-1)+l\_1,\dots,f\_m^{(K-1)}(i\_{K-1}-1)+l\_{K-1},i\_K}$$

Alternatively, the floors in (9.11) could be replaced by ceilings and padding with zeros to receive the right cardinality. This extracts the maximums from the (spatial) windows

$$\begin{aligned} &\left(f\_m^{(1)}(i\_1-1),\ldots,f\_m^{(K-1)}(i\_{K-1}-1),i\_K\right)+\left[1:f\_m^{(1)}\right]\times\cdots\times\left[1:f\_m^{(K-1)}\right]\times\left[0\right] \\ &=\left[f\_m^{(1)}(i\_1-1)+1:f\_m^{(1)}i\_1\right]\times\cdots\times\left[f\_m^{(K-1)}(i\_{K-1}-1)+1:f\_m^{(K-1)}i\_{K-1}\right]\times\left[i\kappa\right],\end{aligned}$$

for each channel <sup>1</sup> <sup>≤</sup> *iK* <sup>≤</sup> *<sup>q</sup>(K) <sup>m</sup>*−<sup>1</sup> individually. Thus, the max-pooling operator is chosen such that it extracts the maximum of each channel and each window, the windows providing a partition of the spatial part of the tensor. This reduces the dimension of the tensor according to (9.11), e.g., if we consider a tensor of order 3 of an RGB image of dimension *I* × *J* = 180 × 50 and apply a max-pooling layer with window sizes *<sup>f</sup> (*1*) <sup>m</sup>* <sup>=</sup> <sup>10</sup> and *<sup>f</sup> (*2*) <sup>m</sup>* <sup>=</sup> 5, we receive a dimension reduction

$$180 \times \text{50} \times \text{3} \iff 18 \times 10 \times \text{3} \dots$$

Replacing the maximum operator in (9.10) by an averaging operator is sometimes also used, and this is called an *average-pooling layer*.

## *9.2.5 Flatten Layer*

A *flatten layer* performs the transformation of rearranging a tensor to a vector, so that the output of a flatten layer can be used as an input to a FN layer. That is,

$$\begin{array}{c} \mathbf{z}^{(m)}: \mathbb{R}^{q\_{m-1}^{(1)} \times \cdots \times q\_{m-1}^{(K)}} \to \mathbb{R}^{q\_m} \\\\ \mathbf{z} \mapsto \mathbf{z}^{(m)}(\mathbf{z}) = \left(z\_{1,\ldots,1}, \ldots, z\_{q\_{m-1}^{(1)},\ldots, q\_{m-1}^{(K)}}\right)^{\top}, \end{array} \tag{9.12}$$

with *qm* <sup>=</sup> <sup>D</sup>*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> *<sup>q</sup>(k) <sup>m</sup>*−<sup>1</sup>. We have already used flatten layers after embedding layers on lines 8 and 11 of Listing 7.4.

## **9.3 Convolutional Neural Network Architectures**

## *9.3.1 Illustrative Example of a CN Network Architecture*

We are now ready to patch everything together. Assume we have RGB images described by tensors *<sup>x</sup>(*0*)* <sup>∈</sup> <sup>R</sup>*I*×*J*×<sup>3</sup> of order 3 modeling the three RGB channels of images of a fixed size *I* × *J* . Moreover, we have the tabular feature information *<sup>x</sup>(*1*)* <sup>∈</sup> *<sup>X</sup>* ⊂ {1} × <sup>R</sup>*<sup>q</sup>* that describes further properties of the data. That is, we have an input variable *(x(*0*) , x(*1*) )*, and we aim at predicting a response variable *Y* by a using a suitable regression function

$$\mu(\mathbf{x}^{(0)}, \mathbf{x}^{(1)}) \mapsto \mu(\mathbf{x}^{(0)}, \mathbf{x}^{(1)}) = \mathbb{E}\left[Y \middle| \mathbf{x}^{(0)}, \mathbf{x}^{(1)}\right]. \tag{9.13}$$

We choose two convolutional layers *z(*CN1*)* and *z(*CN2*)* , each followed by a maxpooling layer *z(*Max1*)* and *z(*Max2*)* , respectively. Then we apply a flatten layer *z(*flatten*)* to bring the learned representation into a vector form. These layers are chosen according to (9.7), (9.10) and (9.12) with matching input and output dimensions so that the following composition is well-defined

$$\mathfrak{z}^{(\mathbf{5:1})} = \left( \mathfrak{z}^{(\text{flatten})} \circ \mathfrak{z}^{(\text{Max2})} \circ \mathfrak{z}^{(\text{CN2})} \circ \mathfrak{z}^{(\text{Max1})} \circ \mathfrak{z}^{(\text{CN1})} \right) : \mathbb{R}^{I \times J \times 3} \to \mathbb{R}^{q\_{\mathfrak{S}}}.$$

Listing 9.1 provides an example starting from a *I* ×*J* ×3 = 180×50×3 input tensor *<sup>x</sup>(*0*)* and receiving a *<sup>q</sup>*<sup>5</sup> <sup>=</sup> <sup>60</sup> dimensional learned representation *<sup>z</sup>(*5:1*) (x(*0*) )* <sup>∈</sup> <sup>R</sup>60.

**Listing 9.1** CN network architecture in keras

```
1 shape <- c(180,50,3)
2 #
3 model = keras_model_sequential()
4 model %>%
5 layer_conv_2d(filters = 10, kernel_size = c(11,6), activation='tanh',
6 input_shape = shape) %>%
7 layer_max_pooling_2d(pool_size = c(10,5)) %>%
8 layer_conv_2d(filters = 5, kernel_size = c(6,4), activation='tanh') %>%
9 layer_max_pooling_2d(pool_size = c(3,2)) %>%
10 layer_flatten()
```



Listing 9.2 gives the summary of this architecture providing the dimension reduction mappings (encodings)

$$180 \times 50 \times 3 \stackrel{\text{CNi}}{\leftrightarrow} 170 \times 45 \times 10 \stackrel{\text{Max1}}{\leftrightarrow} 17 \times 9 \times 10 \stackrel{\text{CNi2}}{\leftrightarrow} 12 \times 6 \times 5 \stackrel{\text{Max2}}{\leftrightarrow} 4 \times 3 \times 5 \stackrel{\text{flatten}}{\leftrightarrow} 60.$$

The first CN layer (*<sup>m</sup>* <sup>=</sup> 1) involves *<sup>q</sup>(*3*)* <sup>1</sup> *r*<sup>1</sup> = 10 · *(*1 + 11 · 6 · 3*)* = 1 990 filter weights *(w(*1*)* <sup>0</sup>*,j , <sup>W</sup>(*1*) <sup>j</sup> )* <sup>1</sup>≤*j*≤*q(*3*)* 1 (including the intercepts), and the second CN layer (*m* = 3) involves *q(*3*)* <sup>3</sup> *r*<sup>3</sup> = 5·*(*1+6·4·10*)* = 1 205 filter weights*(w(*3*)* <sup>0</sup>*,j , <sup>W</sup>(*3*) <sup>j</sup> )* <sup>1</sup>≤*j*≤*q(*3*)* 3 . Altogether we have a network parameter of dimension 3 195 to be fitted in this CN network architecture.

To perform the prediction task (9.13) we concatenate the learned representation *z(*5:1*) (x(*0*) )* <sup>∈</sup> <sup>R</sup>*q*<sup>5</sup> of the RGB image *<sup>x</sup>(*0*)* with the tabular feature *<sup>x</sup>(*1*)* <sup>∈</sup> *<sup>X</sup>* ⊂ {1}×R*<sup>q</sup>* . This concatenated vector is processed through a FN network architecture *z(d*+5:6*)* of depth *d* ≥ 1 providing the output

$$\mathbb{E}\left(\mathbf{z}^{(\mathbf{\tilde{s}}:1)}(\mathbf{x}^{(0)}), \mathbf{x}^{(1)}\right) \mapsto \ \mathbb{E}\left[Y\left|\mathbf{x}^{(0)}, \mathbf{x}^{(1)}\right.\right] = \mathbf{g}^{-1}\left\langle \boldsymbol{\mathcal{B}}, \mathbf{z}^{(d+\mathbf{\tilde{s}}:6)}\left(\mathbf{z}^{(\mathbf{\tilde{s}}:1)}(\mathbf{x}^{(0)}), \mathbf{x}^{(1)}\right)\right\rangle,$$

for given link function *g*. This last step can be done in complete analogy to Chap. 7, and fitting of such a network architecture uses variants of the SGD algorithm.

## *9.3.2 Lab: Telematics Data*

We present a CN network example that studies time-series of telematics car driving data. Unfortunately, this data is not publicly available. Recently, telematics car driving data has gained much popularity in actuarial science, because this data provides information of car drivers that goes beyond the classical features (age of driver, year of driving test, etc.), and it provides a better discrimination of good and bad drivers as it is directly based on the driving habits and the driving styles.

The telematics data has many different aspects. Raw telematics data typically consists of high-frequency GPS location data, say, second by second, from which several different statistics such as speed, acceleration and change of direction can be calculated. Besides the GPS location data, it often contains vehicle speeds from the vehicle instrumental panel, and acceleration in all directions from an accelerometer. Thus, often, there are 3 different sources from which the speed and the acceleration can be extracted. In practice, the data quality is often an issue as these 3 different sources may give substantially different numbers, Meng et al. [271] give a broader discussion on these data quality issues. The telematics GPS data is often complemented by further information such as engine revolutions, daytime of trips, road and traffic conditions, weather conditions, traffic rule violations, etc. This raw telematics data is then pre-processed, e.g., special maneuvers are extracted (speeding, sudden acceleration, hard braking, extreme right- and left-turns), total distances are calculated, driving distances at different daytimes and weekdays are analyzed. For references analyzing such statistics for predictive modeling we refer to Ayuso et al. [17–19], Boucher et al. [42], Huang–Meng [193], Lemaire et al. [246], Paefgen et al. [291], So et al. [344], Sun et al. [347] and Verbelen et al. [370]. A different approach has been taken by Wüthrich [388] and Gao et al. [151, 154, 155], namely, these authors aggregate the telematics data of speed and acceleration to so-called speed-acceleration *v*-*a* heatmaps. These *v*-*a* heatmaps are understood as images which can be analyzed, e.g., by CN networks; such an analysis has been performed in Zhu–Wüthrich [407] for image classification and in Gao et al. [154] for claim frequency modeling. Finally, the work of Weidner et al. [377, 378] directly acts on the time-series of the telematics GPS data by performing a Fourier analysis.

In this section, we aim at allocating individual car driving trips to the right drivers by directly analyzing the time-series of the telematics data of these trips using CN networks. We therefore replicate the analysis of Gao–Wüthrich [156] on slightly different data. For our illustrative example we select 3 car drivers and we call them driver A, driver B and driver C. For each of these 3 drivers we choose individual car driving trips of 180 seconds, and we analyze their speed-acceleration-change in angle (*v*-*a*-) pattern every second. Thus, for *t* = 1*,...,T* = 180, we study the three input channels

$$\mathfrak{x}\_{\mathfrak{s},t} = \left(v\_{\mathfrak{s},t}, a\_{\mathfrak{s},t}, \Delta\_{\mathfrak{s},t}\right)^{\top} \in \{\textsf{2}, \textsf{50}\}
\text{km}\mathfrak{h} \times [-\textsf{3}, \textsf{3}]
\text{m}\mathfrak{s}^{2} \times [0, 1/2] \subset \mathbb{R}^{3},$$

where 1 ≤ *s* ≤ *S* labels all individual trips of the considered drivers. This data has been pre-processed by cutting-out the idling phase and the speeds above 50km/h and concatenating the remaining pieces. We perform this pre-processing since we do not want to identify the drivers because they have a special idling phase picture or because they are more likely on the highway. Acceleration has been censored at <sup>±</sup>3m/s2 because we cannot exclude that more extreme observations are caused by data quality issues (note that the acceleration is calculated from the GPS coordinates and if the signals are not fully precise it can lead to extreme acceleration observations). Finally, change in angle is measured in absolute values of sine per second (censored at 1*/*2), i.e., we do not distinguish between left and right turns. This then provides us with three time-series channels giving tensors of order 2

$$\mathbf{x}\_{\mathbf{s}} = \left( (v\_{\mathbf{s},1}, a\_{\mathbf{s},1}, \Delta\_{\mathbf{s},1})^{\top}, \dots, (v\_{\mathbf{s},180}, a\_{\mathbf{s},180}, \Delta\_{\mathbf{s},180})^{\top} \right)^{\top} \in \mathbb{R}^{180 \times 3},$$

for 1 ≤ *s* ≤ *S*. Moreover, there is a categorical response *Ys* ∈ {A*,*B*,*C} indicating which driver has been driving trip *s*.

Figure 9.1 illustrates the first three trips *x<sup>s</sup>* of *T* = 180 seconds of each of these three drivers A (top), B (middle) and C (bottom); note that the 180 seconds have been chosen at a random location within each trip. The first lines in red color show the acceleration patterns *(at)*<sup>1</sup>≤*t*≤*<sup>T</sup>* , the second lines in black color the change in angle patterns *(t)*<sup>1</sup>≤*t*≤*<sup>T</sup>* , and the last lines in blue color the speed patterns *(vt)*<sup>1</sup>≤*t*≤*<sup>T</sup>* .

Table 9.1 summarizes the available data. In total we have 932 individual trips, and we randomly split these trips in the learning data *L* consisting of 744 trips and the test data *T* collecting the remaining trips. The goal is to train a classification model that correctly allocates the test data *T* to the right driver. As feature information, we use the telematics data *x<sup>s</sup>* of length 180 seconds. We design a logistic categorical regression model with response set *Y* = {A*,* B*,*C}. Hence, we obtain a vector-valued parameter EF with a response having 3 levels, see Sect. 2.1.4.

To process the telematics data *xs*, we design a CN network architecture having three convolutional layers *<sup>z</sup>(*CN*j )*, <sup>1</sup> <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> 3, each followed by a max-pooling layer *z(*Max*j )*, then we apply a drop-out layer *z(*DO*)* and finally a fully-connected FN layer *z(*FN*)* providing the logistic response classification; this is the same network architecture as used in Gao–Wüthrich [156]. The code is given in Listing 9.3 and it describes the mapping

$$\begin{split} \mathbf{z}^{(\text{S1})} &= \left( \mathbf{z}^{(\text{FN})} \circ \mathbf{z}^{(\text{DO})} \circ \mathbf{z}^{(\text{Max3})} \circ \mathbf{z}^{(\text{CN3})} \circ \mathbf{z}^{(\text{Max2})} \circ \mathbf{z}^{(\text{CN2})} \circ \mathbf{z}^{(\text{Max1})} \circ \mathbf{z}^{(\text{CN1})} \right) : \\ & \qquad \mathbb{R}^{T \times 3} \to (0, 1)^3. \end{split}$$

The first CN and pooling layer *<sup>z</sup>(*Max1*)* ◦ *<sup>z</sup>(*CN1*)* maps the dimension <sup>180</sup> <sup>×</sup> <sup>3</sup> to a tensor of dimension 58 × 12 using 12 filters; the max-pooling uses the floor (9.11). The second CN and pooling layer *<sup>z</sup>(*Max2*)* ◦ *<sup>z</sup>(*CN2*)* maps to <sup>18</sup> <sup>×</sup> <sup>10</sup> using 10 filters, and the third CN and pooling layer *<sup>z</sup>(*Max3*)* ◦ *<sup>z</sup>(*CN3*)* maps to <sup>1</sup> <sup>×</sup> <sup>8</sup> using 8 filters. Actually, this last max-pooling layer is a global max-pooling layer extracting the maximum in each of the 8 filters. Next, we apply a drop-out layer with a drop-out

**Fig. 9.1** First 3 trips of driver A (top), driver B (middle) and driver C (bottom); each trip is 180 seconds, red color shows the acceleration pattern *(at)t* , black color the change in angle pattern *(t)t* and blue color the speed pattern *(vt)t*

**Table 9.1** Summary of the trips and the choice of learning and test data sets *L* and *T*


rate of 30% to prevent from over-fitting. Finally we apply a fully-connected FN layer that maps the 8 neurons to the 3 categorical outputs using the softmax output activation function, which provides the canonical link of the logistic categorical EF. **Listing 9.3** CN network architecture for the individual car trip allocation

```
1 shape <- c(180,3)
2 #
3 model = keras_model_sequential()
4 model %>%
5 layer_conv_1d(filters = 12, kernel_size = 5, activation='tanh',
6 input_shape = shape) %>%
7 layer_max_pooling_1d(pool_size = 3) %>%
8 layer_conv_1d(filters = 10, kernel_size = 5, activation='tanh') %>%
9 layer_max_pooling_1d(pool_size = 3) %>%
10 layer_conv_1d(filters = 8, kernel_size = 5, activation='tanh') %>%
11 layer_global_max_pooling_1d() %>%
12 layer_dropout(rate = .3) %>%
13 layer_dense(units = 3, activation = 'softmax')
```
For a summary of the network architecture see Listing 9.4. Altogether this involves 1'237 network parameters that need to be fitted.


**Listing 9.4** Summary of CN network architecture for the individual car trip allocation

We choose the 744 trips of the learning data *L* to train this network to the classification task, see Table 9.1. We use the multi-class cross-entropy loss function, see (4.19), with 80% of the learning data *L* as training data *U* and the remaining 20% as validation data *V* to track over-fitting. We retrieve the network with the smallest validation loss using a callback, we refer to Listing 7.3 for a callback. Since the learning data is comparably small and to reduce randomness, we use the nagging predictor averaging over 10 different network fits (using different seeds).


These fitted networks then provide us with a mapping

$$\mathbb{Z}^{(8;1)}: \mathbb{R}^{T \times 3} \to (0,1)^3, \qquad \mathbf{x} \mapsto \mathbf{z}^{(8;1)}(\mathbf{x}) = \left( \mathbf{z}\_{\mathbf{A}}^{(8;1)}(\mathbf{x}), \mathbf{z}\_{\mathbf{B}}^{(8;1)}(\mathbf{x}), \mathbf{z}\_{\mathbf{C}}^{(8;1)}(\mathbf{x}) \right)^\top,$$

and for each trip *<sup>x</sup><sup>s</sup>* <sup>∈</sup> <sup>R</sup>*<sup>T</sup>* <sup>×</sup><sup>3</sup> we receive the classification

$$
\widehat{Y}\_s = \underset{\mathbf{y} \in \{\mathbf{A}, \mathbf{B}, \mathbf{C}\}}{\text{arg}\max} \ \mathbf{z}\_{\mathbf{y}}^{(8;1)}(\mathbf{x}\_s) .
$$

Table 9.2 shows the out-of-sample results on the test data *T* . On average more than 80% of all trips are correctly allocated; a purely random allocation would provide a success rate of 33%. This shows that this allocation problem can be solved rather successfully and, indeed, the CN network architecture is able to learn structure in the telematics trip data *x<sup>s</sup>* that allows one to discriminate car drivers. This sounds very promising. In fact, the telematics car driving data seems to be very transparent which, of course, also raises privacy issues. On the downside we should mention that from this approach we cannot really see what the network has learned and how it manages to distinguish the different trips.

There are several approaches that try to visualize what the network has learned in the different layers by extracting the filter activations in the CN layers, others try to invert the networks trying to backtrack which activations and weights mostly contribute to a certain output, we mention, e.g., DeepLIFT of Shrikumar et al. [339]. For more analysis and references we refer to Sect. 4 of the tutorial Meier–Wüthrich [269]. We do not further discuss this and close this example.

## *9.3.3 Lab: Mortality Surface Modeling*

We revisit the mortality example of Sect. 8.4.2 where we used a LSTM architecture to process the raw mortality data for forecasting, see Fig. 8.13. We are going to do a (small) change to that architecture by simply replacing the LSTM encoder by a CN network encoder. This approach has been promoted in the literature, e.g., by Perla et al. [301], Schnürch–Korn [330] and Wang et al. [375]. A main difference between these references is whether the mortality tensor is considered as a tensor of order 2 (reflecting time-series data) or of order 3 (reflecting the mortality surface as an image). In the present example we are going to interpret the mortality tensor as a monochrome image, and this requires that we extend (8.23) by an additional channels component

$$\begin{aligned} \mathbf{x}\_{t-\tau:t-1} &= (\mathbf{x}\_{t-\tau}, \dots, \mathbf{x}\_{t-1})^\top \\ &= \left(M\_{\mathbf{x},\mathbf{s}}\right)\_{t-\tau \le s \le t-1, \mathbf{x}\_0 \le \mathbf{x} \le \mathbf{x}\_1} \in \mathbb{R}^{\tau \times (\mathbf{x}\_1 - \mathbf{x}\_0 + 1) \times 1} = \mathbb{R}^{\mathbf{s} \times 100 \times 1}, \end{aligned}$$

for a lookback period of *τ* = 5. The LSTM cell encodes this tensor/matrix into a 20 dimensional vector which is then concatenated with the embeddings of the country code and the gender code (8.24). We use the same architecture here, only the LSTM part is replaced by a CN network in (8.25), the corresponding code is given on lines 14–17 of Listing 9.5.

**Listing 9.5** CN network architecture to directly process the raw mortality rates *(Mx,t)x,t*

```
1 Tensor = layer_input(shape=c(lookback,100,1), dtype='float32', name='Tensor')
2 Country = layer_input(shape=c(1), dtype='int32', name='Country')
3 Gender = layer_input(shape=c(1), dtype='int32', name='Gender')
4 Time = layer_input(shape=c(1), dtype='float32', name='Time')
5 #
6 CountryEmb = Country %>%
7 layer_embedding(input_dim=8,output_dim=1,input_length=1,name='CountryEmb') %>%
8 layer_flatten(name='Country_flat')
9 #
10 GenderEmb = Gender %>%
11 layer_embedding(input_dim=2,output_dim=1,input_length=1,name='GenderEmb') %>%
12 layer_flatten(name='Gender_flat')
13 #
14 CN = Tensor %>%
15 layer_conv_2d(filter = 10, kernel_size = c(5,5), activation = 'linear') %>%
16 layer_max_pooling_2d(pool_size = c(1,8)) %>%
17 layer_flatten()
18 #
19 Output = list(CN,CountryEmb,GenderEmb) %>% layer_concatenate() %>%
20 layer_dense(units=100, activation='linear', name='scalarproduct') %>%
21 layer_reshape(c(1,100), name = 'Output')
22 #
23 model = keras_model(inputs = list(Tensor, Country, Gender),
24 outputs = c(Output))
```
Line 15 maps the input tensor 5×100×1 to a tensor 1×96×10 having 10 filters, the max-pooling layer reduces this tensor to 1 × 12 × 10, and the flatten layer encodes this tensor into a 120-dimensional vector. This vector is then concatenated with the embedding vectors of the country and the gender codes, and this provides us with *r* = 12 570 network parameters, thus, the LSTM architecture and the CN network architecture use roughly equally many network parameters that need to be fitted. We then use the identical partition in training, validation and test data as in Sect. 8.4.2, i.e., we use the data from 1950 to 2003 for fitting the network architecture, which is then used to forecast the calendar years 2004 to 2018. The results are presented in Table 9.3.


**Table 9.3** Comparison of the out-of-sample mean squared losses for the calendar years 2004 ≤ *<sup>t</sup>* <sup>≤</sup> 2018; the figures are in 10−<sup>4</sup>

We observe that in our case the CN network architecture provides good results for the female populations, whereas for the male populations we rather prefer the LSTM architecture. At the current stage we rather see this as a proof of concept, because we have not really fine-tuned the network architectures, nor has the SGD fitting been perfected, e.g., often bigger architectures are used in combination with dropouts, etc. We refrain from doing so, here, but refer to the relevant literature Perla et al. [301], Schnürch–Korn [330] and Wang et al. [375] for a more sophisticated fine-tuning.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 10 Natural Language Processing**

Natural language processing (NLP) is a vastly growing field that is studying language, communication and text recognition. The purpose of this chapter is to present an introduction to NLP. Important milestones in the field of NLP are the work of Bengio et al. [28, 29] who have introduced the idea of word embedding, the work of Mikolov et al. [275, 276] who have developed word2vec which is an efficient word embedding tool, and the work of Pennington et al. [300] and Chaubard et al. [68] who provide the pre-trained word embedding model GloVe1 and detailed educational material.<sup>2</sup> An excellent overview of the NLP working pipeline is provided by the tutorial of Ferrario–Nägelin [126]. This overview distinguishes three approaches: (1) the classical approach using bag-of-words and bag-of-partof-speech models to classify text documents; (2) the modern approach using word embeddings to receive a low-dimensional representation of the dictionary, which is then further processed; (3) the contemporary approach uses a minimal amount of text pre-processing but directly feeds raw data to a machine learning algorithm. We discuss these different approaches and show how they can be used to extract the relevant information from claim descriptions to predict the claim types and the claim sizes; in the actuarial literature first papers on this topic have been published by Lee et al. [236] and Manski et al. [264].

## **10.1 Feature Pre-processing and Bag-of-Words**

NLP requires an extensive feature pre-processing and engineering as different texts can be rather diverse in language, grammar, abbreviations, typos, etc. The current developments aim at automating this process, nevertheless, many of these steps

<sup>1</sup> https://nlp.stanford.edu/projects/glove/.

<sup>2</sup> https://nlp.stanford.edu/teaching/.

M. V. Wüthrich, M. Merz, *Statistical Foundations of Actuarial Learning and its Applications*, Springer Actuarial, https://doi.org/10.1007/978-3-031-12409-9\_10

are still (tedious) manual work. Our goal here is to present the whole working pipeline to process language, perform text recognition and text understanding. As an example we use the claim data described in Chap. 13.3; this data has been made available through the book project of Frees [135], and it comprises property claims of governmental institutions in Wisconsin, US. An excerpt of the data is given in Listing 10.1; our attention applies to line 11 which provides a (very) short claim description for every claim.

**Listing 10.1** Excerpt of the Wisconsin Local Government Property Insurance Fund (LGPIF) data set with short claim descriptions on line 11


In a first step we need to pre-process the texts to make them suitable for predictive modeling. This first step is called *tokenization*. Essentially, tokenization labels the words with integers, that is, the used vocabulary is encoded by integers. There are several issues that one has to deal with in this first step such as upper and lower case, punctuation, orthographic errors and differences, abbreviations, etc. Different treatments of these issues will lead to different results, for more on this topic we refer to Sect. 1 in Ferrario–Nägelin [126]. We simply use the standard routine offered in R keras [77] called text\_tokenizer() with its standard settings.

**Listing 10.2** Tokenization within R keras [77]

```
1 library(keras)
2
3 ## initialize tokenizer and fit
4 tokenizer <- text_tokenizer() %>% fit_text_tokenizer(dat$Description)
5
6 ## number of tokens/words
7 length(tokenizer$word_index)
8
9 ## frequency of word appearances in each text
10 freq.text <- texts_to_matrix(tokenizer, dat$Description, mode = "count")
```
The R code in Listing 10.2 shows the crucial steps in tokenization. Line 4 extracts the relevant vocabulary from all available claim descriptions. In total the 5'424 claim

descriptions of Listing 10.1 use *W* = 2 237 different words. This double counts different spellings, e.g., 'color' vs. 'colour'.

Figure 10.1 shows the most frequently used words in the claim descriptions of Listing 10.1. These are (in this order): 'at', 'damage', 'damaged', 'vandalism', 'lightning', 'to', 'water', 'glass', 'park', 'fire', 'hs', 'wind', 'light', 'door', 'es', 'and', 'of', 'vehicle', 'pole' and 'power'. We observe that many of these words are directly related to insurance claims, such as 'damage' and 'vandalism', others are frequent *stopwords* like 'at' and 'to', and then there are abbreviations like 'hs' and 'es' standing for high school and elementary school.

**Listing 10.3** Word and text encoding

```
1 maxlen <- max(rowSums(freq.text))
2
3 ## encode the sentences
4 text.seq <- texts_to_sequences(tokenizer, dat$Description)
5
6 ## pad the sentences
7 text.seq.pad <- pad_sequences(text.seq, maxlen = maxlen, padding = "post")
8
9 ## examples
10 lightning/hail damage to equip at airport
11 5 48 2 6 196 1 40 0 0 0 0
12 ##
13 garage door damaged
14 36 14 3 0 0 0 0 0 0 0 0
```
The next step is to assign the (integer) labels 1 ≤ *w* ≤ *W* from the tokenization to the words in the texts. The maximal length over all texts/sentences is *T* = 11 words. This step and padding the sentences with zeros to equal length *T* is presented on lines 1–7 of Listing 10.3. Lines 11 and 14 of this listing give two explicit text examples

$$\mathbf{t} \mathbf{x} \mathbf{t} = (w\_{\mathbb{I}}, \dots, w\_{T})^{\top} \in \mathcal{W}\_{0}^{T},$$

where we set for the vocabulary *W*<sup>0</sup> used

$$\mathcal{W} = \{1, \dots, W\} \subset \mathbb{N} \qquad \text{and} \qquad \mathcal{W}\_0 = \mathcal{W} \cup \{0\}.$$

The label 0 is used for padding shorter texts to the common length *T* = 11. The method of *bag-of-words* embeds text = *(w*1*,...,wT )* into N*<sup>W</sup>* 0

$$\psi: \mathbb{W}\_0^T \to \mathbb{N}\_0^W, \qquad \mathbf{t} \text{ext} \mapsto \psi(\mathbf{t} \text{ext}) = \left(\sum\_{l=1}^T \mathbb{1}\_{\{w\_l = w\}}\right)^l\_{w \in \mathcal{W}}.\tag{10.1}$$

The bag-of-words *ψ(*text*)* counts how often each word *w* ∈ *W* appears in a given text = *(w*1*,...,wT )*-; the corresponding code is given on line 10 of Listing 10.2. The bag-of-words mapping *ψ* is not injective as the order of occurrence of the words gets lost, and, thus, also the semantics of the sentence gets lost. E.g., the bag-of-words of the following two sentences is the same 'The claim is expensive.' and 'Is the claim expensive?'. This is the reason for calling it a "bag of words" (which is unordered). This bag-of-words encoding resembles one-hot encoding, namely, if every text consists of a single word *T* = 1, then we receive the one-hot encoding with *W* describing the number of different levels, see (7.28). The bag-ofwords *ψ(*text*)* <sup>∈</sup> <sup>N</sup>*<sup>W</sup>* <sup>0</sup> can directly be used as an input to a regression model. The disadvantage of this approach is that the input typically is high-dimensional (and likely sparse), and it is recommended that only the frequent words are considered.

**Listing 10.4** Removal of stopwords and lemmatization

```
1 library(textstem)
2 library(tm)
3
4 text.clean <- removeWords(dat$Description, stopwords("english"))
5 text.clean <- lemmatize_strings(text.clean, dictionary = lexicon::hash_lemmas)
```
Additionally, stopwords can be removed. We perform this removal below because frequent stopwords like 'and' or 'to' may not essentially contribute to the understanding of the (short) claim descriptions; the code for the stopword removal is provided on line 4 of Listing 10.4. Moreover, stemming can be performed which means that inflectional forms are reduced to their stem by just truncating pre- and suffixes, conjugations, declensions, etc. Lemmatization is a more sophisticated form of reducing inflectional forms by using vocabularies and morphological analyses; an example is provided on line 5 of Listing 10.4. If we perform these two steps of removing stopwords and lemmatization to our example, the number of different words is reduced from 2'237 to 1'982.

Another step that can be performed is tagging words with part-of-speech (POS) attributes. These POS attributes indicate whether the corresponding words are used as nouns, adjectives, adverbs, etc., in the corresponding sentences. We then call the resulting encoding bag-of-POS. We refrain from doing this because we will present more sophisticated methods in the next sections.

## **10.2 Word Embeddings**

The bag-of-words (10.1) can be interpreted as representing each word *w* ∈ *W* = {1*,...,W*} by a one-hot encoding in {0*,* 1} *<sup>W</sup>* , and then aggregating these one-hot encodings over all words that appear in the given text = *(w*1*,...,wT )* -. Bengio et al. [28, 29] have introduced the technique of *word embedding* that maps words to a lower dimensional Euclidean space <sup>R</sup>*b*, *<sup>b</sup>* " *<sup>W</sup>*, such that proximity in <sup>R</sup>*<sup>b</sup>* is associated with similarity in the meaning of the word, e.g., 'rain', 'water' and 'flood' should be more close to each other in R*<sup>b</sup>* than to 'vandalism' (in an insurance context). This is exactly the idea promoted in the embedding mapping (7.31) using the embedding layers. Thus, we are looking for an embedding mapping

$$\mathfrak{e}: \mathcal{W} \to \mathbb{R}^b, \qquad w \mapsto \mathfrak{e}(w), \tag{10.2}$$

that maps each word *w* (or rather its tokenization) to a *b*-dimensional vector *e(w)*, for a given embedding dimension *b* " *W*. The general idea now is that similarity in the meaning of words can be learned from the context in which the words are used in. That is, when we consider a text

$$\mathbf{t} \mathbf{x} \mathbf{t} = (w\_1, \dots, w\_{l-1}, w\_l, w\_{l+1}, \dots, w\_T)^\top,$$

then it might be possible to infer *wt* from its neighbors *wt*−*<sup>j</sup>* and *wt*+*<sup>j</sup>* , *j* ≥ 1. This explains the context of a word *wt* , and using suitable learning tools it should also be possible to learn synonyms for *wt* as these synonyms will stand in similar contexts.

More mathematically speaking, we assume that there exists a probability distribution *p* over the set of all texts of length *T* (using padding with zeros to common length)

$$\mathsf{T} = \left\{ \texttt{text} = (w\_{\mathsf{l}}, \dots, w\_{T})^{\top} \right\} \subseteq \mathcal{W}\_{0}^{T},$$

such that a randomly chosen text ∈ T appears with probability *p(w*1*,...,wT )* ∈ [0*,* 1*)*. Inference of a word *wt* from its context can then be obtained by studying the conditional probablity of *wt* , given its context, that is

$$p\left(w\_l \mid w\_1, \dots, w\_{l-1}, w\_{l+1}, \dots, w\_T\right) = \frac{p(w\_1, \dots, w\_T)}{p(w\_1, \dots, w\_{l-1}, w\_{l+1}, \dots, w\_T)}.\tag{10.3}$$

Since, typically, the probability distribution *p* is not known we aim at learning it from the available data. This idea has been taken up by Mikolov et al. [275, 276] who designed the word to vector (word2vec) algorithm. Pennington et al. [300] designed an alternative algorithm called global vectors (GloVe); we also refer to Chaubard et al. [68]. We describe these algorithms in the following sections.

## *10.2.1 Word to Vector Algorithms*

There are two ways of estimating the probability *p* in (10.3). Either we can try to predict the *center word wt* from its context as in (10.3) or we can try to predict the context from the center word *wt* , which applies Bayes's rule to (10.3). The latter variant is called *skip-gram* and the former variant is called *continuous bag-of-words* (CBOW), if we neglect the order of the words in the context. These two approaches have been developed by Mikolov et al. [275, 276].

#### **Skip-gram Approach**

Typically, inferring a general probability distribution *p* over T is too complex. Therefore, we make a simplifying assumption. This simplifying assumption is not reasonable from a practical linguistic point of view, but it is sufficient to receive a reasonable word embedding map *<sup>e</sup>* : *<sup>W</sup>* <sup>→</sup> <sup>R</sup>*b*. We assume conditional i.i.d. of the context words, given the center word *wt* . Choosing a fixed context (window) size *<sup>c</sup>* <sup>∈</sup> <sup>N</sup>, we try to maximize the log-likelihood over all probabilities *<sup>p</sup>* satisfying this conditional i.i.d. assumption

$$\ell\_{\mathbf{W}} = \sum\_{l=1}^{n} \log p\left(w\_{l,l-c}, \dots, w\_{l,l-1}, w\_{l,l+1}, \dots, w\_{l,l+c} \left| w\_{l,l} \right)\right)$$

$$= \sum\_{l=1}^{n} \sum\_{-c \le j \le c, l \ne 0} \log p\left(w\_{l,l+j} \left| w\_{l,l} \right),\tag{10.4}$$

having *n* independent rows in the observed data matrix *W* = *(wi,t*−*c,..., wi,t*+*c)*<sup>1</sup>≤*i*≤*<sup>n</sup>* <sup>∈</sup> *<sup>W</sup>n*×*(*2*c*+1*)* . Thus, under the conditional i.i.d. of the context words, given the center word, the probabilities (10.4) infer the occurrence of (individual) context words of a given center word *wi,t* within a symmetric window of fixed size *c*. In the sequel we directly work with the log-likelihood (10.4), supposed that a context word *wi,t*+*<sup>j</sup>* exists for index *j* , otherwise the corresponding term is just dropped from the sum in (10.4).

The remaining step is to estimate the conditional probabilities *p(wt*+*<sup>j</sup>* |*wt)* from the data matrix *W*. This step will provide us with the embeddings (10.2). This estimation step is received by considering an approach similar to a GLM for categorical responses, see Sect. 5.7. We make the following ansatz for the context word *ws* and the center word *wt* (for all *j* )

$$p\left(w\_{\boldsymbol{s}}\mid w\_{\boldsymbol{l}}\right) = \frac{\exp\left<\widetilde{\mathbf{e}}(w\_{\boldsymbol{s}}), \mathbf{e}(w\_{\boldsymbol{l}})\right>}{\sum\_{w=1}^{W} \exp\left<\widetilde{\mathbf{e}}(w), \mathbf{e}(w\_{\boldsymbol{l}})\right>} \in (0, 1),\tag{10.5}$$

where *<sup>e</sup>* and \**<sup>e</sup>* are two (different) embedding maps (10.2) that have the same embedding dimension *<sup>b</sup>* <sup>∈</sup> <sup>N</sup>. Thus, we construct two different embeddings *<sup>e</sup>* and\**<sup>e</sup>* for the center words and for the context words, respectively, and these embeddings (embedding weights) are chosen such that the log-likelihood (10.4) is maximized for the given observations *W*. These assumptions give us a minimization problem for the negative log-likelihood in the embedding mappings, i.e., we minimize over the embeddings *<sup>e</sup>* and\**<sup>e</sup>*

$$-\ell \le -\sum\_{i=1}^{n} \sum\_{-c \le j \le c, \, j \ne 0} \log \left( \frac{\exp \left[ \overleftarrow{\tilde{\mathbf{e}}}(w\_{i, t+j}), \mathbf{e}(w\_{i, t}) \right]}{\sum\_{w=1}^{W} \exp \left[ \overleftarrow{\tilde{\mathbf{e}}}(w), \mathbf{e}(w\_{i, t}) \right]} \right) \tag{10.6}$$

$$= -\sum\_{i=1}^{n} \left( \sum\_{-c \le j \le c, \, j \ne 0} \left< \tilde{\mathbf{e}}(w\_{i, t+j}), \mathbf{e}(w\_{i, t}) \right> - 2c \log \left( \sum\_{w=1}^{W} \exp \left[ \overleftarrow{\tilde{\mathbf{e}}}(w), \mathbf{e}(w\_{i, t}) \right] \right) \right) .$$

These optimal embeddings are learned using a variant of the gradient descent algorithm. This often results in a very high-dimensional optimization problem as we have 2*bW* parameters to learn, and the calculation of the last (normalization) term in (10.6) can be very expensive in gradient descent algorithms. For this reason we present the method of negative sampling below.

#### **Continuous Bag-of-Words**

For the CBOW method we start from the log-likelihood for a context size *<sup>c</sup>* <sup>∈</sup> <sup>N</sup> and given the observations *W*

$$\sum\_{l=1}^{n} \log p\left(w\_{l,l} \mid w\_{l,l-c}, \dots, w\_{l,l-1}, w\_{l,l+1}, \dots, w\_{l,l+c}\right).$$

Again we need to reduce the complexity which requires an approximation to the above. Assume that the embedding map of the context words is given by\**<sup>e</sup>* : *<sup>W</sup>* <sup>→</sup> R*b*. We then average over the embeddings of the context words in order to predict the center word. Define the average embedding of the context words of *wi,t* (with a fixed window size *c*) by

$$
\widetilde{e}\_{l,l} = \frac{1}{2c} \sum\_{-c \le j \le c, j \ne 0} \widetilde{\mathbf{e}}(w\_{l,l+j}).
$$

Making an ansatz similar to (10.5), the full log-likelihood is approximated by

$$\sum\_{l=1}^{n} \log p\left(w\_{l,l} \left| \widetilde{e}\_{l,l} \right| \right) = \sum\_{l=1}^{n} \log \left( \frac{\exp \left| \widetilde{e}\_{l,l}, \mathbf{e}(w\_{l,l}) \right|}{\sum\_{w=1}^{W} \exp \left| \widetilde{e}\_{l,l}, \mathbf{e}(w) \right|} \right) \tag{10.7}$$

$$= \sum\_{l=1}^{n} \left< \widetilde{e}\_{l,l}, \mathbf{e}(w\_{l,l}) \right> - \log \left( \sum\_{w=1}^{W} \exp \left| \widetilde{e}\_{l,l}, \mathbf{e}(w) \right| \right).$$

Again the gradient descent method is applied to the negative log-likelihood to learn the optimal embedding maps *<sup>e</sup>* and\**e*.

*Remark 10.1* In both cases, skip-gram and CBOW, we estimate two separate embeddings *<sup>e</sup>* and\**<sup>e</sup>* for the center word and the context words. Typically, CBOW is faster but skip-gram is better on words that are less frequent.

#### **Negative Sampling**

There is a computational issue in (10.6) and (10.7) because the probability normalizations in (10.6) and (10.7) aggregate over all available words *w* ∈ *W*. This can be computationally demanding because we need to perform this calculation in each gradient descent step. For this reason, Mikolov et al. [276] turn the log-likelihood optimization problem (10.6) into a binary classification problem. Consider a pair *(w, w)* \* <sup>∈</sup> *<sup>W</sup>* <sup>×</sup> *<sup>W</sup>* of center word *<sup>w</sup>* and context word *<sup>w</sup>*\*. We introduce a binary response variable *<sup>Y</sup>* ∈ {1*,* <sup>0</sup>} that indicates whether an observation *(W, W )* \* <sup>=</sup> *(w, w)* \* is coming from a true center-context pair (from our texts) or whether we have a fake center-context pair (that has been generated randomly). Choosing the canonical link of the Bernoulli EF (logistic/sigmoid function) we make the following ansatz (in the skip-gram approach) to test for the authenticity of a centercontext pair *(w, w)* \*

$$\mathbb{P}\left[Y=1\mid w,\widetilde{w}\right] = \frac{1}{1 + \exp\left\{-\langle \widetilde{\mathfrak{e}}(\widetilde{w}), \mathfrak{e}(w)\rangle\right\}}.\tag{10.8}$$

The recipe now is as follows: (1) Consider for a given window size *c* all centercontext pairs*(wi, <sup>w</sup>*\**i)* <sup>∈</sup> *<sup>W</sup>*×*<sup>W</sup>* of our texts, and equip them with a response *Yi* <sup>=</sup> 1. Assume we have *<sup>N</sup>* such observations. (2) Simulate *<sup>N</sup>* i.i.d. pairs *(WN*+*<sup>k</sup> , <sup>W</sup>*\**N*+*<sup>k</sup> )*, <sup>1</sup> <sup>≤</sup> *<sup>k</sup>* <sup>≤</sup> *<sup>N</sup>*, by randomly choosing *WN*+*<sup>k</sup>* and *<sup>W</sup>*\**N*+*<sup>k</sup>* , independent from each other (by performing independent re-sampling with or without replacements from the data *(wi)*<sup>1</sup>≤*i*≤*<sup>N</sup>* and *(w*\**i)*<sup>1</sup>≤*i*≤*N*, respectively). Equip these (false) pairs with the response *YN*+*<sup>k</sup>* = 0. (3) Maximize the following log-likelihood as a function of the embedding maps *<sup>e</sup>* and\**<sup>e</sup>*

$$\ell Y = \sum\_{l=1}^{2N} \log \mathbb{P} \left[ Y = Y\_l \, | \, w\_l, \widetilde{w}\_l \right] \tag{10.9}$$

$$= \sum\_{l=1}^{N} \log \left( \frac{1}{1 + \exp(-\widetilde{\mathfrak{e}}(\widetilde{w}\_l), \mathfrak{e}(w\_l))} \right) + \sum\_{k=N+1}^{2N} \log \left( \frac{1}{1 + \exp(\widetilde{\mathfrak{e}}(\widetilde{w}\_k), \mathfrak{e}(w\_k))} \right).$$

This approach is called *negative sampling* because we sample false or negative pairs *(WN*+*<sup>k</sup> , <sup>W</sup>*\**N*+*<sup>k</sup> )* that should not appear in our texts (as *WN*+*<sup>k</sup>* and *<sup>W</sup>*\**N*+*<sup>k</sup>* have been generated independently from each other). The binary classification (10.9) aims at detecting the negative pairs be letting the scalar products \**e(w*\**i), <sup>e</sup>(wi)* be large for the true pairs and letting the scalar products \**e(w*\**k), <sup>e</sup>(wk)* be small for the false pairs. The former means that \**e(w*\**i)* and *<sup>e</sup>(wi)* should point into the same direction in the embedding space R*b*. The same should apply for a synonym of *wi* and, thus, we receive the desired behavior that synonyms or words with similar meanings tend to cluster.

*Example 10.2 (word2vec with Negative Sampling)* We provide an example by constructing a word2vec embedding based on negative sampling. For this we aim at maximizing the log-likelihood (10.9) by finding optimal embedding maps *e* and \**<sup>e</sup>* : *<sup>W</sup>* <sup>→</sup> <sup>R</sup>*b*. To construct these embedding maps we use the Wisconsin LGPIF data described in Sect. 13.3. The first decision (hyper-parameter) is the choice of the embedding dimension *b*. English language has millions of different words, and these words should be (in some sense) densely embedded into a *b*-dimensional Euclidean space. Typical choices of *b* vary between 50 and 300. Our LGPIF data vocabulary is much smaller, and for this example we choose *b* = 2 because this allows us to nicely illustrate the learned embeddings. However, apart from illustration, we should not choose such a small dimension as it does not allow for a sufficient flexibility in discriminating the words, as we will see.

We consider all available claim texts described in Sect. 13.3. These are 6'031 texts coming from the training and validation data sets (we include the validation data here to have more texts for learning the embeddings; this is different from Sect. 10.1). We extract the claim descriptions from these two data sets and we apply some pre-processing to the texts. This involves transforming all letters to lower case, removing the special characters like !"/&, and removing the stopwords. Moreover, we remove the words 'damage' and 'damaged' as these two words are very common in our insurance claim descriptions, see Fig. 10.1, but they do not further specify the claim type. Then we apply lemmatization, see Listing 10.4, and we adjust the vocabulary with the GloVe database,<sup>3</sup> see also Example 10.4. The latter step is

<sup>3</sup> https://nlp.stanford.edu/projects/glove/.

(tedious) manual work, and we do this step to be able to compare our results to pre-trained word2vec versions.

After this pre-processing we apply the tokenizer, see line 4 of Listing 10.2. This gives us 1 829 different words. To construct our (illustrative) embedding we only consider the words that appear at least 20 times over all texts, these are *W* = 142 words. Thus, the following analysis is only based on the *W* = 142 most frequent words. Of course, we could increase our vocabulary by considering any text that can be downloaded from the internet. Since we would like to perform an insurance claim analysis, these texts should be related to an insurance context so that the learned embeddings reflect an insurance experience; we come back to this in Remark 10.4, below. We refrain here from doing so and embed these *W* = 142 words into the Euclidean plane (*b* = 2).

**Listing 10.5** Tokenization of the most frequent words

```
1 ## applying the tokenizer to the cleaned texts
2 tokenizer <- text_tokenizer(num_words=142+1) %>% fit_text_tokenizer(dat$clean)
4 seqs <- texts_to_sequences(tokenizer, dat$clean)
6 ## skip-gram of text 1 using a window of size 2
7 skipgrams(sequence=unlist(seqs[[1]]),
8 vocabulary_size=142, window_size=2, negative_samples=0)
```
Listing 10.5 shows the tokenization of the most frequent words, and on line 4 we build the (shortened) texts *w*1*, w*2*,...,* only considering these most frequent words *w* ∈ *W* = {1*,...,W*}. In total we receive 4'746 texts that contain at least two words from *W* and, hence, can be used for the skip-gram building of center-context pairs *(w, w)* \* <sup>∈</sup> *<sup>W</sup>* <sup>×</sup> *<sup>W</sup>*. Lines 7–8 give the code for building these pairs for a window of size *c* = 2. In total we receive *N* = 23 952 center-context pairs *(wi, <sup>w</sup>*\**i)* from our texts. We equip these pairs with a response *Yi* = 1. For the false pairs, we randomly permute the second component of the true pairs *(WN*+*i, <sup>W</sup>*\**N*+*i)* <sup>=</sup> *(wi, <sup>w</sup>*\**τ (i))*, where *τ* is a random permutation of {1*,...,N*}. These false pairs are equipped with a response *YN*+*<sup>i</sup>* = 0. Thus, altogether we have 2*N* = 47 904 observations *(Yi, wi, <sup>w</sup>*\**i)*, 1 <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> <sup>2</sup>*N*, that can be used to learn the embeddings *<sup>e</sup>* and\**e*. Listing 10.6 shows the R code to perform the embedding learning using the negative

sampling (10.9). This network has 2*bW* = 568 embedding weights that need to be learned from the data. There are two more parameters involved on line 10 of Listing 10.6. These two parameters shift the scalar products by an intercept *β*<sup>0</sup> and scale them by a constant *β*1. We could set *(β*0*, β*1*)* = *(*0*,* 1*)*, however, keeping these two parameters trainable has led to results that are better centered around the origin. Of course, these two parameters do not harm the arguments as they only

3

5

**Listing 10.6** R code for negative sampling

```
1 center = layer_input(shape = c(1), dtype = 'int32')
2 context = layer_input(shape = c(1), dtype = 'int32')
3 #
4 centerEmb = center %>%
5 layer_embedding(input_dim=142,output_dim=2,input_length=1) %>% layer_flatten()
6 contextEmb = context %>%
7 layer_embedding(input_dim=142,output_dim=2,input_length=1) %>% layer_flatten()
8 #
9 response = list(centerEmb, contextEmb) %>% layer_dot(axes = 1) %>%
10 layer_dense(units=1, activation='sigmoid', name='response')
11 #
12 model = keras_model(inputs = c(center, context), outputs = c(response))
```
replace (10.8) by a slightly different model

$$\mathbb{P}[Y=1|\,\boldsymbol{w},\,\widetilde{\boldsymbol{w}}] = \frac{1}{1 + \exp\{-\beta\_0 - \beta\_1(\widetilde{\boldsymbol{e}}(\widetilde{\boldsymbol{w}}),\,\boldsymbol{e}(\boldsymbol{w}))\}} = \frac{e^{\beta\_0}}{e^{\beta\_0} + e^{-\beta\_1(\widetilde{\boldsymbol{e}}(\widetilde{\boldsymbol{w}}),\,\boldsymbol{e}(\boldsymbol{w}))}},$$

and

$$\mathbb{P}\left[Y=0\mid w,\widetilde{w}\right] = 1 - \frac{e^{\beta\_0}}{e^{\beta\_0} + e^{-\beta\_1\langle\widetilde{\mathfrak{e}}(\widetilde{w}), \mathfrak{e}(w)\rangle}} = \frac{e^{-\beta\_0}}{e^{-\beta\_0} + e^{\beta\_1\langle\widetilde{\mathfrak{e}}(\widetilde{w}), \mathfrak{e}(w)\rangle}}.$$

We fit this model using the nadam version of the gradient descent algorithm, and the fitted embedding weights can be extracted with get\_weights(model). Figure 10.2 shows the learned embedding weights *<sup>e</sup>(w)* <sup>∈</sup> <sup>R</sup><sup>2</sup> of all words *<sup>w</sup>* <sup>∈</sup> *<sup>W</sup>*. We highlight the words that coincide with the insured hazards in red color, see line 10 of Listing 10.1. The word 'vehicle' is in the first quadrant and it is surrounded by 'pole', 'truck', 'garage', 'car', 'traffic'. The word 'vandalism' is in the third quadrant surrounded by 'graffito', 'window', 'pavilion', names of cites and parks, 'ms' for middle school. Finally, the words 'fire', 'wind', 'lightning' and 'hail' are in the first and fourth quadrant, close to 'water'; these words are surrounded by 'bldg' (building), 'smoke', 'equipment', 'alarm', 'safety', 'power', 'library', etc. We conclude that these embeddings make perfect sense in an insurance claim context. Note that we have applied some pre-processing, and embeddings could even be improved by further pre-processing, e.g., 'vandalism' and 'vandalize' or 'hs' and 'high school' are used.

Another nice observation is that the embeddings tend to build a circle around the origin, see Fig. 10.2. This is enforced by embedding *W* = 142 different words into a *b* = 2 dimensional space so that dissimilar words optimally repulse each other. -

#### **2−dimensional embedding of center word**

**Fig. 10.2** Two-dimensional skip-gram embedding using negative sampling; in red color are the insured hazards 'vehicle', 'fire', 'lightning', 'wind', 'hail', 'water' and 'vandalism'

## *10.2.2 Global Vectors Algorithm*

A second popular word embedding approach is global vectors (GloVe) developed by Pennington et al. [300], we also refer to Chaubard et al. [68]. GloVe is an unsupervised learning method that performs a word-word clustering (center-context pairs) over all available texts. Assume that the tokenization of all texts provides us with the words *<sup>w</sup>* <sup>∈</sup> *<sup>W</sup>*. Choose a fixed context window size *<sup>c</sup>* <sup>∈</sup> <sup>N</sup> and define the matrix

$$\mathcal{C} = \left(\mathcal{C}(w,\widetilde{w})\right)\_{w,\widetilde{w}\in\mathcal{W}} \in \mathbb{N}\_0^{W\times W},$$

with *C(w, w)* \* counting the number of co-occurrences of *<sup>w</sup>* and *<sup>w</sup>*\* over all available texts where the word *<sup>w</sup>*\* appears as a context word of the center word *<sup>w</sup>* (for the given window size *c*). We note that *C* is a symmetric matrix that is typically sparse as many words do not appear in the context of other words (on finitely many texts). Figure 10.3 shows the center-context pairs *(w, w)* \* co-occurrence matrix *<sup>C</sup>*

of Example 10.2 which is based on *W* = 142 words and 23'952 center-context pairs. The color pixels indicate the pairs that occur in the data, *C(w, w) >* \* 0, and the white space corresponds to the pairs that have not been observed in the texts, *C(w, w)* \* <sup>=</sup> 0. This plot confirms the sparsity of the center-context pairs; the words are ordered w.r.t. their frequencies in the texts.

In an empirical analysis Pennington et al. [300] have observed that the crucial quantities to be considered are the ratios for fixed context words. That is, for a context word *<sup>w</sup>*\* study a function of the center words *<sup>w</sup>* and *<sup>v</sup>* (subject to existence of the right-hand side)

$$((w, v, \widetilde{w}) \mapsto F(w, v, \widetilde{w}) = \frac{C(w, \widetilde{w}) / \sum\_{\widetilde{w} \in \mathcal{W}} C(w, \widetilde{u})}{C(v, \widetilde{w}) / \sum\_{\widetilde{w} \in \mathcal{W}} C(v, \widetilde{u})} = \frac{\widehat{p}(\widetilde{w}|w)}{\widehat{p}(\widetilde{w}|v)},$$

*<sup>p</sup>* denoting the empirical probabilities. An empirical analysis suggests that such an approach seems to lead to a good discrimination of the meanings of the words, see Sect. 3 in Pennington et al. [300]. Further simplifications and assumptions provide the following ansatz, for details we refer to Pennington et al. [300],

$$\log C(w,\widetilde{w}) \approx \langle \widetilde{\mathfrak{e}}(\widetilde{w}), \mathfrak{e}(w) \rangle + \beta\_{\widetilde{w}} + \beta\_w, \, \xi$$

with intercepts *β* \**w*\**, βw* <sup>∈</sup> <sup>R</sup>. There is still one issue, namely, that log*C(w, w)* \* may not be well-defined as certain pairs *(w, w)* \* are not observed. Therefore, Pennington et al. [300] propose to solve a weighted squared error loss function problem to find the embedding mappings *<sup>e</sup>,*\**<sup>e</sup>* and intercepts *<sup>β</sup>* \**w*\**, βw* <sup>∈</sup> <sup>R</sup>. Their objective function is given by

$$\sum\_{w,\widetilde{w}\in\mathcal{W}}\chi(C(w,\widetilde{w}))\left(\log C(w,\widetilde{w})-\langle\widetilde{\mathfrak{e}}(\widetilde{w}),\mathfrak{e}(w)\rangle-\widetilde{\beta}\_{\widetilde{w}}-\beta\_{w}\right)^{2},\tag{10.10}$$

with weighting function

$$\chi \ge 0 \iff \chi(x) = \left(\frac{x \wedge x\_{\text{max}}}{x\_{\text{max}}}\right)^{\alpha},$$

for *x*max *>* 0 and *α >* 0. Pennington et al. [300] state that the model depends weakly on the cutoff point *x*max, they propose *x*max = 100, and a sub-linear behavior seems to outperform a linear one, suggesting, e.g., a choice of *α* = 3*/*4. Under these choices the embeddings *<sup>e</sup>* and\**<sup>e</sup>* are found by minimizing the objective function (10.10) for the given data. Note that lim*x*↓<sup>0</sup> *χ (x)(*log *x)*<sup>2</sup> <sup>=</sup> 0.

*Example 10.3 (GloVe Word Embedding)* We provide an example using the GloVe embedding model, and we revisit the data of Example 10.2; we also use exactly the same pre-processing as in that example. We start from *N* = 23 952 center-context pairs.

In a first step we count the number of co-occurrences *C(w, w)* \* . There are only 4'972 pairs that occur, *C(w, w) >* \* 0, this corresponds to the colors in Fig. 10.3. With these 4'972 pairs we have to fit 568 embedding weights (for the embedding dimension *b* = 2) and 284 intercepts *β* \**w*\**, βw*, thus, 852 parameters in total. The results of this fitting are shown in Fig. 10.4.

The general picture in Fig. 10.4 is similar to Fig. 10.2, e.g., 'vandalism' is surrounded by 'graffito', 'window', 'pavilion', names of cites and parks, 'ms' and 'es'; or 'vehicle' is surrounded by 'pole', 'traffic', 'street', 'signal'. However, the clustering of the words around the origin shows a crucial difference between GloVe and the negative sampling of word2vec. The problem here is that we do not have sufficiently many observations. We have 4'972 center-context pairs that occur, *C(w, w) >* \* 0. 2'396 of these pairs occur exactly once, *C(w, w)* \* <sup>=</sup> 1, this is almost half of the observations with *C(w, w) >* \* 0. GloVe (10.10) considers these observations on the log-scale which provides log*C(w, w)* \* <sup>=</sup> <sup>0</sup> for the pairs that occur exactly once. The weighted square loss for these pairs is minimized by either setting \**e(w)* \* <sup>=</sup> <sup>0</sup> or *<sup>e</sup>(w)* <sup>=</sup> 0, supposed that the intercepts are also set to 0. This is exactly what we observe in Fig. 10.4 and, thus, successfully fitting GloVe would require much more (frequent) observations. -

*Remark 10.4 (Pre-trained Word Embeddings)* In practical applications we rely on pre-trained word embeddings. For GloVe there are pre-trained versions that can be downloaded.<sup>4</sup> These pre-trained versions comprise a vocabulary of 400K words, and they exist for the embedding dimensions *b* = 50*,* 100*,* 200*,* 300. These GloVe's have been trained on Wikipedia 2014 and Gigaword 5 which provided roughly 6B tokens. Another pre-trained open-source model that can be downloaded is spaCy.<sup>5</sup>

<sup>4</sup> https://nlp.stanford.edu/projects/glove/.

<sup>5</sup> https://spacy.io/models/en#en\_core\_web\_md.

**2−dimensional embedding of center word**

**Fig. 10.4** Two-dimensional GloVe embedding; in red color are the insured hazards 'vehicle', 'fire', 'lightning', 'wind', 'hail', 'water' and 'vandalism'

Pre-trained embeddings can be problematic if we work in very specific settings. For instance, the Wisconsin LGPIF data contains the word 'Lincoln' in the claim descriptions. Now, Lincoln is a county in Wisconsin, it is town in Kewaunee County in Wisconsin, it is a former US president, there are Lincoln memorials, it is a common street name, it is a car brand and there are restaurants with this name. In our context, Lincoln is most commonly used w.r.t. the Lincoln Elementary and Middle Schools. On the other hand, it is likely that in pre-trained embeddings a different meaning of Lincoln is predominant, and therefore the embedding may not be reasonable for our insurance problem.

## **10.3 Lab: Predictive Modeling Using Word Embeddings**

This section gives an example of applying the word embedding technique to a predictive modeling setting. This example is based on the Wisconsin LGPIF data set illustrated in Listing 10.1. Our goal is to predict the hazard types on line 10 of Listing 10.1 from the claim descriptions on line 11. We perform the same data cleaning process as in Example 10.2. This provides us with *W* = 1 829 different words, and the resulting (short) claim descriptions have a maximal length of *T* = 9. After padding with zeros we receive *n* = 6 031 claim descriptions given by texts *(w*1*,...,wT )*- <sup>∈</sup> *<sup>W</sup><sup>T</sup>* <sup>0</sup> ; we apply the padding to the left end of the sentences.

**Word2vec Using Negative Sampling** We start by the word2vec embedding technique using the negative sampling. We follow Example 10.2, and to successfully embed the available words *w* ∈ *W* we restrict the vocabulary to the words that are used at least 20 times. This reduces the vocabulary from 1'892 different words to 142 different words. The number of claim descriptions are reduced to 5'883 because 148 claim descriptions do not contain any of these 142 different words and, thus, cannot be classified as one of the hazard types (based on this reduced vocabulary).

In a first analysis we choose the embedding dimension *b* = 2, and this provides us with the word2vec embedding map that is illustrated in Fig. 10.2. Based on these embeddings we aim at predicting the hazard types from the claim descriptions. We have 9 different hazard types: Fire, Lightning, Hail, Wind, WaterW, WaterNW, Vehicle, Vandalism and Misc.<sup>6</sup> Therefore, we design a categorical classification model that has 9 different labels, we refer to Sect. 2.1.4.

**Listing 10.7** R code for the hazard type prediction based on a word2vec embedding

```
1 input = layer_input(shape = list(T), name = "input")
2 #
3 word2vec = input %>%
4 layer_embedding(input_dim = W+1, output_dim = b, input_length = T,
5 weights=list(wordEmb), trainable=FALSE) %>%
6 layer_flatten()
7 # response = word2vec %>%
8 layer_dense(units=20, activation='tanh', name='FNLayer1') %>%
9 layer_dense(units=15, activation='tanh', name='FNLayer2') %>%
10 layer_dense(units=9, activation='softmax', name='output')
11 #
12 model = keras_model(inputs = c(input), outputs = c(response))
```
The R code for the hazard type prediction is presented in Listing 10.7. The crucial part is shown on line 5. Namely, the embedding map *<sup>e</sup>(w)* <sup>∈</sup> <sup>R</sup>*b*, *<sup>w</sup>* <sup>∈</sup> *<sup>W</sup>* is initialized with the embedding weights wordEmb received from Example 10.2, and

<sup>6</sup> WaterW relates to weather related water claims, and WaterNW relates to non-weather related water claims.

**Fig. 10.5** Confusion matrices of the hazard type prediction using a word2vec embedding based on negative sampling (lhs) *b* = 2 dimensional embedding and (rhs) *b* = 10 dimensional embedding; columns show the observations and rows show the predictions

these embedding weights are declared to be non-trainable.<sup>7</sup> These features are then inputted into a FN network with two FN layers having *(q*1*, q*2*)* = *(*20*,* 15*)* neurons, and as output activation we choose the softmax function. This model has 286 nontrainable embedding weights, and *r* = *(*9 · 2+1*)*20+*(*20+1*)*15+*(*15+1*)*9 = 839 trainable parameters.

We fit this network using the nadam version of the gradient descent method, and we exercise an early stopping on a 20% validation data set (of the entire data). This network is fitted in a few seconds, and the results are presented in Fig. 10.5 (lhs). This figure shows the confusion matrix of prediction vs. observed (row vs. column). The general results look rather good, there are only difficulties to distinguish WaterN from WaterNW claims.

In a second analysis, we increase the embedding dimension to *b* = 10 and we perform exactly the same procedure as above. A higher embedding dimension allows the embedding map to better discriminate the words in their meanings. However, we should not go for a too high *b* because we have only 142 different words and 47'904 center-context pairs*(w, w)* \* to learn these embeddings *<sup>e</sup>(w)* <sup>∈</sup> <sup>R</sup>*b*. A higher embedding dimension also increases the number of network weights in the first FN layer on line 9 of Listing 10.7. This time, we need to train *r* = *(*9 · 10 + 1*)*20 + *(*20 + 1*)*15 + *(*15 + 1*)*9 = 2 279 parameters. The results are presented in Fig. 10.5 (rhs). We observe an overall improvement compared to the 2-dimensional embeddings. This is also confirmed by Table 10.1 which gives the deviance losses and the misclassification rates.

The zeros from padding are mapped to the origin.


**Table 10.1** Hazard prediction results summarized in deviance losses and misclassification rates

**Pre-trained GloVe Embedding** In a next analysis we use the pre-trained GloVe embeddings, see Remark 10.4. This allows us to use all *W* = 1 892 words that appear in the *n* = 6 031 claim descriptions, and we can also classify all these claims. I.e., we can classify more claims, here, compared to the 5'883 claims we have classified based on the self-trained word2vec embeddings. Apart from that, all modeling steps are chosen as above. Only the higher embedding dimension *b* = 50 from the pre-trained glove.6B.50d increases the size of the network parameter to *r* = *(*9 · 50 + 1*)*20 + *(*20 + 1*)*15 + *(*15 + 1*)*9 = 9 479 parameters; remark that the 91'500 embedding weights are not trained as they come from the pre-trained GloVe embeddings. Using the nadam optimizer with an early stopping provides us with the results in Fig. 10.6 (lhs). Using this pre-trained GloVe embedding leads to a further improvement, this is also verified by Table 10.1. Using the pre-trained GloVe is two-fold. On the one hand, it allows us to use all words of the claim descriptions, which improves the prediction accuracy. On the other hand, the embeddings are not adapted to insurance problems, as these have been trained on Wikipedia and Gigaword texts. The former advantage overrules the latter shortcoming in our example.

All the results above have been using the FN network of Listing 10.7. We made this choice because our texts have a maximal length of *T* = 9, which is very short. In general, texts should be understood as time-series, and RN networks are a canonical choice to analyze these time-series. Therefore, we study again the pretrained GloVe embeddings, but we process the texts with a LSTM architecture, we refer to Sect. 8.3.1 for LSTM layers.

Listing 10.8 shows the LSTM architecture used. On line 9 we set the variable return\_sequences to true which implies that all intermediate steps *z*[1] *<sup>t</sup>* , 1 ≤ *t* ≤ *T* , are outputted to a time-distributed FN layer on line 10, see Sect. 8.2.4 for time-distributed layers. This LSTM network has *r* = 4*(*50 + 1 + 10*)*10 + *(*10 + 1*)*10 + *(*90 + 1*)*9 = 3 369 parameters. The flatten layer on line 11 of Listing 10.8 turns the *<sup>T</sup>* <sup>=</sup> 9 outputs *<sup>z</sup>*[2] *<sup>t</sup>* <sup>∈</sup> <sup>R</sup>*q*<sup>2</sup> , 1 <sup>≤</sup> *<sup>t</sup>* <sup>≤</sup> *<sup>T</sup>* , of dimension *<sup>q</sup>*<sup>2</sup> <sup>=</sup> 10 into a vector of size *T q*<sup>2</sup> = 90. This vector is then fed into the output layer on line 12. At this stage, one could reduce the dimension of the parameter by setting a max-pooling layer in between the flatten and the output layer.

**Fig. 10.6** Confusion matrices of the hazard type prediction using the pre-trained GloVe with *b* = 50 (lhs) FN network and (rhs) LSTM network; columns show the observations and rows show the predictions

**Listing 10.8** R code for the hazard type prediction using a LSTM architecture

```
1 input = layer_input(shape = list(T), name = "input")
2 #
3 word2vec = input %>%
4 layer_embedding(input_dim = W+1, output_dim = b, input_length = T,
5 weights=list(wordEmb), trainable=FALSE) %>%
6 layer_flatten()
7 #
8 response = word2vec %>%
9 layer_lstm(units=10, activation='tanh', return_sequences=TRUE,
10 name='LSTM') %>%
11 time_distributed(layer_dense(units=10, activation='tanh', name='FNLayer')) %>%
12 layer_flatten() %>%
13 layer_dense(units=9, activation='softmax', name='output')
14 #
15 model = keras_model(inputs = c(input), outputs = c(response))
```
We fit this LSTM architecture to the data using the pre-trained GloVe embeddings. The results are presented in Fig. 10.6 (rhs) and Table 10.1. We receive the same deviance loss, and the misclassification rate is slightly worse than in the FN network case (with the same pre-trained GloVe embeddings). Note that the deviance loss is calculated on the estimated classification probabilities *<sup>p</sup>(x)* <sup>=</sup> *(p* <sup>1</sup>*(x), . . . , <sup>p</sup>* <sup>9</sup>*(x))*-, and the labels are received by

$$
\widehat{Y} = \widehat{Y}(\mathbf{x}) \ = \underset{k=1,\ldots,9}{\text{arg}\max} \,\widehat{p}\_k(\mathbf{x}) .
$$

Thus, it may happen that the improvements on the estimated probabilities are not fully reflected on the predicted labels.

**Word (Cosine) Similarity** In our final analysis we work with the pre-trained GloVe embeddings *<sup>e</sup>(w)* <sup>∈</sup> <sup>R</sup><sup>50</sup> but we first try to reduce the embedding dimension *<sup>b</sup>*. For this we follow Lee et al. [236], and we consider a *word similarity*. We can define the similarity of the words *w* and *w* ∈ *W* by considering the scalar product of their embeddings

$$\text{sim}^{(u)}(w, w') = \left< \mathbf{e}(w), \mathbf{e}(w') \right> \qquad \text{or} \qquad \text{sim}^{(n)}(w, w') = \frac{\left< \mathbf{e}(w), \mathbf{e}(w') \right>}{\|\mathbf{e}(w)\| \, \|\mathbf{e}(w')\| \, \|\mathbf{e}}. \tag{10.11}$$

The first one is an unweighted version and the second one is a normalized version scaling with the corresponding Euclidean norms so that the similarity measure is within [−1*,* 1]. In fact, the latter is also called cosine similarity. To reduce the embedding dimension and because we have a classification problem with hazard names, we can evaluate the (cosine) similarity of all used words *w* ∈ *W* to the hazards *h* ∈ *H* = {fire*,* lightning*,* hail*,* wind*,* water*,* vehicle*,* vandalism}. Observe that water is further separated into weather related and non-weather related claims, and there is a further hazard type called misc, which collects all the rest. We could choose more words in *H* to more precisely describe these water and other claims. If we just use *H* we obtain a *b* = |*H*| = 7 dimensional embedding mapping

$$\mathbf{w} \in \mathcal{W}\_0 \mapsto \mathfrak{e}^{(a)}(w) = \left( \text{sim}^{(a)}(w, \mathtt{f.i.re}), \ldots, \text{sim}^{(a)}(w, \mathtt{vanda.1.i.sm}) \right)^{\mid} \in \mathbb{R}^{b=7},\tag{10.12}$$

for *a* ∈ {*u, n*}. This gives us for every text = *(w*1*,...,wT )*- <sup>∈</sup> *<sup>W</sup><sup>T</sup>* <sup>0</sup> the preprocessed features

$$\mathbf{t} \mathbf{x} \mathbf{t} \mapsto \left( \mathbf{e}^{(a)}(w\_1), \dots, \mathbf{e}^{(a)}(w\_T) \right)^{\top} \in \mathbb{R}^{T \times b}. \tag{10.13}$$

Lee et al. [236] apply a max-pooling layer to these embeddings which are then inputted into GAM classification model. We use a different approach here, and directly use the unweighted (*a* = *u*) text representations (10.13) as an input to a network, either of FN network type of Listing 10.7 or of LSTM type of Listing 10.8. If we use the FN network type we receive the results on the last line of Table 10.1 and Fig. 10.7.

Comparing the results of the word similarity through the embeddings (10.12) and (10.13) to the other prediction results, we conclude that this word similarity approach is not fully competitive compared to working directly with the word2vec or GloVe embeddings. It seems that the projection (10.12) does not discriminate sufficiently for our classification task.

## **10.4 Lab: Deep Word Representation Learning**

All examples above have been relying on embedding the words *w* ∈ *W* into a Euclidean space *<sup>e</sup>(w)* <sup>∈</sup> <sup>R</sup>*<sup>b</sup>* by performing a sort of unsupervised learning that provided word similarity clusters. The advantage of this approach is that the embedding is decoupled from the regression or classification task, this is computationally attractive. Moreover, once a suitable embedding has been learned, it can be used for several different tasks (in the spirit of transfer learning). The disadvantage of the pre-trained embeddings is that the embedding is not targeted to the regression task at hand. This has already been discussed in Remark 10.4 where we have highlighted that the meaning of some words (such as Lincoln) depends very much on its context.

Recent NLP aims at pre-processing a text as little as necessary, but tries to directly feed the raw sentences into RN networks such as LSTM or GRU architectures. Computationally this is much more demanding because we have to learn the embeddings and the network weights simultaneously, we refer to Table 10.1 to indicate the number of parameters involved. The purpose of this short section is to give an example, though our NLP database is rather small; this latter approach usually requires a huge database and the corresponding computational power. Ferrario–Nägelin [126] provide a more comprehensive example on the classification of movie reviews. For their analysis they evaluated approximately 50'000 movie reviews each using between 235 and 2'498 words. Their analysis was implemented on the ETH High Performance Computing (HPC) infrastructure Euler8, and their run times have been between 20 and 30 minutes, see Table 8 of Ferrario–Nägelin [126].

<sup>8</sup> https://scicomp.ethz.ch/wiki/Euler

Since we neither have the computational power nor the big data to fit such a NLP application, we start the gradient descent fitting in the initial embedding weights *<sup>e</sup>(w)* <sup>∈</sup> <sup>R</sup>*<sup>b</sup>* that either come from the word2vec or the GloVe embeddings. During the gradient descent fitting, we allow these weights to change w.r.t. the regression task at hand. In comparison to Sect. 10.3, this only requires minor changes to the R code, namely, the only modification needed is to change from FALSE to TRUE on lines 5 in Listings 10.7 and 10.8. This change allows us to learn adapted weights during the gradient descent fitting. The resulting classification models are now very high-dimensional, and we need to carefully assess the early stopping rule, otherwise the model will (in-sample) over-fit to the learning data.

In Fig. 10.8 we provide the results that correspond to the self-trained word2vec embeddings given in Fig. 10.5, and the corresponding numerical results are given in Table 10.2. We observe an improvement in the prediction accuracy in both cases by letting the embedding weights being learned during the network fitting, and we receive a misclassification rate of 11.6% and 11.0% for the embedding dimensions *b* = 2 and *b* = 10, respectively, see Table 10.2.

Figure 10.8 (rhs) illustrates how the embeddings have changed from the initial (pretrained) embeddings *e(*0*) (w)* (coming from the word2vec negative sampling) to the learned embeddings *<sup>e</sup>(w)*. We measure these changes in terms of the unweighted similarity measure defined in (10.11), and given by

$$\left\langle \mathfrak{e}^{(0)}(w), \widehat{\mathfrak{e}}(w) \right\rangle. \tag{10.14}$$

The upper horizontal line is a manually set threshold to identify the words *w* that experience a major change in their embeddings. These are the words 'vandalism', 'lightning', 'grafito', 'fence', 'hail', 'freeze', 'blow' and 'breakage'. Thus, these words receive a different embedding location/meaning which is more favorable for our classification task.

A similar analysis can be performed for the pre-trained GloVe embeddings. There we expected bigger changes to the embeddings since the GloVe embeddings have not been learned in an insurance context, and the embeddings will be adapted to the insurance prediction problem. We refrain from giving an explicit analysis, here, because to perform a thorough analysis we would need (much) more data.

We conclude this example with some remarks. We emphasize once more that our available data is minimal, and we expect (even much) better results for longer claim descriptions. In particular, our data is not sufficient to discriminate the weather related from the non-weather related water claims, as the claim descriptions seem to focus on the water claim itself and not on its cause. In a next step, one should use claim descriptions in order to predict the claim sizes, or to improve their predictions if they are based on classical tabular features, only. Here, we see some potential, in particular, w.r.t. medical claims, as medical reports may clearly indicate the severity of the claim as well as these reports may give some insight into the recovery process. Thus, our small example may only give some intuition of what is possible with

**Fig. 10.8** Confusion matrices and the changes in the embeddings compared to the pre-trained word2vec embeddings of Fig. 10.5 for the dimensions *b* = 2 and *b* = 10

**Table 10.2** Hazard prediction results summarized in deviance losses and misclassification rates: pre-trained embeddings vs. network learned embeddings


**change of word2vec embeddings (b=2)**

(unstructured) text data. Unfortunately, the LGPIF data of Listing 10.1 did not give us any satisfactory results for the claim size prediction, this for several reasons. Firstly, the data is rather heterogeneous ranging from small to very large claims and any member of the EDF struggles to model this data; we come back to a different modeling proposal of heterogeneous data in Sect. 11.3.2. Secondly, the claim descriptions are not very explanatory as they are too short for a more detailed information. Thirdly, the data has only 5'424 claims which seems small compared to the complexity of the problem that we try to solve.

## **10.5 Outlook: Creating Attention**

In text recognition problems, obviously, not all the words in a sentence have the same importance. In the examples above, we have removed the stopwords as they may disturb the key understanding of our texts. Removing the stopwords means that we pay more attention to the remaining words. RN networks often face difficulty in giving the right recognition to the different parts of a sentence. For this reason, *attention layers* have gained more popularity recently. Attention layers are special modules in network architectures that allow the network to impose more weight on certain parts of the information in the features to emphasize their importance. The attention mechanism has been introduced in Bahdanau et al. [21]. There are different ways of modeling attention, the most popular one is the so-called *dotproduct attention*, we refer to Vaswani et al. [366], and in the actuarial literature we mention Kuo–Richman [231] and Troxler–Schelldorfer [354].

We start by describing a simple attention mechanism. Consider a sentence text <sup>=</sup> *(w*1*,...,wT )* <sup>∈</sup> *<sup>W</sup><sup>T</sup>* <sup>0</sup> that provides, under an embedding map *e* : *W*<sup>0</sup> → R*b*, the embedded sentence *(e(w*1*), . . . , e(wT ))*- <sup>∈</sup> <sup>R</sup>*<sup>T</sup>* <sup>×</sup>*b*. We choose a weight matrix *UQ* <sup>∈</sup> <sup>R</sup>*b*×*<sup>b</sup>* and an intercept vector *<sup>u</sup><sup>Q</sup>* <sup>∈</sup> <sup>R</sup>*b*. Based on these choices we consider for each word *wt* of our sentence the score, called *query*,

$$\mathfrak{q}\_{\mathfrak{l}} = \tanh\left(\mathfrak{u}\_{\mathcal{Q}} + U\_{\mathcal{Q}}\mathfrak{e}(w\_{\mathfrak{l}})\right) \ \in \ (-1, 1)^{b}. \tag{10.15}$$

Matrix *Q* = *(q*1*,..., q<sup>T</sup> )*- <sup>∈</sup> <sup>R</sup>*<sup>T</sup>* <sup>×</sup>*<sup>b</sup>* collects all queries. It is obtained by applying a time-distributed FN layer with *b* neurons to the embedded sentence *(e(w*1*), . . . , e(wT ))*-.

These queries *<sup>q</sup><sup>t</sup>* are evaluated with a so-called *key <sup>k</sup>* <sup>∈</sup> <sup>R</sup>*<sup>b</sup>* giving us the *attention weights*

$$\alpha\_{l} = \frac{\exp\left<\mathbf{k}, \mathbf{q}\_{l}\right>}{\sum\_{s=1}^{T} \exp\left<\mathbf{k}, \mathbf{q}\_{s}\right>} \in (0, 1) \qquad \text{for } 1 \le t \le T. \tag{10.16}$$

Using these attention weights *α* = *(α*1*,...,αT )*- <sup>∈</sup> *(*0*,* <sup>1</sup>*)<sup>T</sup>* we encode the sentence text as

$$\mathbf{t}\mathbf{t}\mathbf{x}\mathbf{t} = (w\_1, \dots, w\_T) \mapsto \left.\mathbf{w}^\* = \sum\_{l=1}^T a\_l \mathbf{e}(w\_l) \right. \tag{10.17}$$

$$= (\mathbf{e}(w\_1), \dots, \mathbf{e}(w\_T)) \,\mathbf{a} \in \mathbb{R}^b.$$

Thus, to every sentence text we assign a categorical probability vector *α* = *α(*text*)* ∈ *T* , see Sect. 2.1.4, (6.22) and (5.69), which is encoding this sentence text to a *<sup>b</sup>*-dimensional vector *<sup>w</sup>*<sup>∗</sup> <sup>∈</sup> <sup>R</sup>*b*. This vector is then further processed by the network. Such a construction is called a *self-attention mechanism* because the text *(w*1*,...,wT )* <sup>∈</sup> *<sup>W</sup><sup>T</sup>* <sup>0</sup> is used to formulate the queries in (10.15), but, of course, these queries could also be coming from a completely different source. In the above set-up we have to learn the following parameters *UQ* <sup>∈</sup> <sup>R</sup>*b*×*<sup>b</sup>* and *<sup>u</sup>Q, <sup>k</sup>* <sup>∈</sup> <sup>R</sup>*b*, assuming that the embedding map *<sup>e</sup>* : *<sup>W</sup>*<sup>0</sup> <sup>→</sup> <sup>R</sup>*<sup>b</sup>* has already been specified.

There are several generalizations and modifications to this self-attention mechanism. The most common one is to expand the vector *<sup>w</sup>*<sup>∗</sup> <sup>∈</sup> <sup>R</sup>*<sup>b</sup>* in (10.17) to a matrix *W*<sup>∗</sup> = *(w*<sup>∗</sup> <sup>1</sup>*,..., w*<sup>∗</sup> *<sup>q</sup> )* <sup>∈</sup> <sup>R</sup>*b*×*<sup>q</sup>* . This matrix *<sup>W</sup>*<sup>∗</sup> can be interpreted as having *q* neurons *w*∗ *<sup>j</sup>* <sup>∈</sup> <sup>R</sup>*b*, 1 <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>q</sup>*. For this, one replaces the key *<sup>k</sup>* <sup>∈</sup> <sup>R</sup>*<sup>b</sup>* by a matrixvalued key *<sup>K</sup>* <sup>=</sup> *(k*1*,..., <sup>k</sup><sup>q</sup> )* <sup>∈</sup> <sup>R</sup>*b*×*<sup>q</sup>* . This allows one to calculate the attention weight matrix

$$\begin{aligned} A &= (\alpha\_{l,j})\_{1 \le l \le T, 1 \le j \le q} &= \left( \frac{\exp \langle \mathbf{k}\_j, \mathbf{q}\_l \rangle}{\sum\_{s=1}^T \exp \langle \mathbf{k}\_j, \mathbf{q}\_s \rangle} \right)\_{1 \le l \le T, 1 \le j \le q} \\ &= \text{softmax} \left( QK \right) &\in \left( 0, 1 \right)^{T \times q}, \end{aligned}$$

where the softmax function is applied column-wise. I.e., the attention weight matrix *<sup>A</sup>* <sup>∈</sup> *(*0*,* <sup>1</sup>*)<sup>T</sup>* <sup>×</sup>*<sup>q</sup>* has columns *<sup>α</sup><sup>j</sup>* <sup>=</sup> *(α*1*,j ,...,αT ,j )*- ∈ *T* , 1 ≤ *j* ≤ *q*, which are normalized to total weight 1, this is equivalent to (10.16). This is used to encode the sentence text

$$\begin{aligned} \left(\mathfrak{e}(w\_1), \dots, \mathfrak{e}(w\_T)\right) \in \mathbb{R}^{b \times T} &\mapsto \ W^\* = \left(\mathfrak{e}(w\_1), \dots, \mathfrak{e}(w\_T)\right) A \\ &= \left(\sum\_{l=1}^T \mathfrak{a}\_{l,f} \mathfrak{e}(w\_l)\right)\_{1 \le j \le q} \in \mathbb{R}^{b \times q} .\end{aligned} \tag{10.18}$$

Mapping (10.18) is called an *attention layer*. Let us give some remarks.

#### *Remarks 10.5*

• Encoding (10.18) gives a natural multi-dimensional extension of (10.17). The crucial parts are the attention weights *α<sup>j</sup>* ∈ *T* which weigh the different words *(wt)*<sup>1</sup>≤*t*≤*<sup>T</sup>* . In the multi-dimensional case, we perform this weighting mechanism multiple times (in different directions), allowing us to extract different features from the sentences. In contrast, in (10.17) we only do this once. This is similar as going form one neuron to a layer of *q* neurons.


$$W = (\mathbf{e}(w\_1), \dots, \mathbf{e}(w\_T)) \in \mathbb{R}^{b \times T} \mapsto \frac{W + W^\*}{2} \in \mathbb{R}^{b \times T}.\tag{10.19}$$

Stacking multiple of these layers (10.19) transforms the original input *W* by weighing the important information in feature *W* for the prediction task at hand. Compared to LSTM layers this no longer sequentially screens the text but it directly acts on the part of the text that seems important.

• The attention mechanism is applied to a matrix *(e(w*1*), . . . , e(wT ))*- <sup>∈</sup> <sup>R</sup>*<sup>T</sup>* <sup>×</sup>*<sup>b</sup>* which presents a numerical encoding of the sentence *(w*1*,...,wT )*- <sup>∈</sup> *<sup>W</sup><sup>T</sup>* 0 . Kuo–Richman [231] propose to apply this attention mechanism more generally to categorical feature components. Assume that we have *T* categorical feature components *x*1*,...,xT* , after embedding them into *b*-dimensional Euclidean spaces we receive a representation *(e(x*1*), . . . , e(xT ))*- <sup>∈</sup> <sup>R</sup>*<sup>T</sup>* <sup>×</sup>*b*, see (7.31). Naturally, this can now be further processed by putting different attention on the components of this embedding exactly using an attention layer (10.18), alternatively we can use transformer layers (10.19).

*Example 10.6* We revisit the hazard type prediction example of Sect. 10.3. We select the *b* = 10 word2vec embedding (using negative sampling) and the pre-trained GloVe embedding of Table 10.1. These embeddings are then further processed by applying the attention mechanism (10.15)–(10.17) on the embeddings using one single attention neuron. Listing 10.9 gives the corresponding implementation. On line 9 we have the query (10.15), on lines 10–13 the key and the attention weights (10.16), and on line 15 the encodings (10.17). We then process these encodings through a FN network of depth *d* = 2, and we use the softmax output activation to receive the categorical probabilities. Note that we keep the learned word embeddings *e(w)* as non-trainable on line 5 of Listing 10.9.

Table 10.3 gives the results, and Fig. 10.9 shows the confusion matrix. We conclude that the results are rather similar, this attention mechanism seems to work quite well, and with less parameters, here. -

```
Listing 10.9 R code for the hazard type prediction using an attention layer with q = 1
```

```
1 input = layer_input(shape = list(T), name = "input")
2 #
3 word2vec = input %>%
4 layer_embedding(input_dim = W+1, output_dim = b, input_length = T,
5 weights=list(wordEmb), trainable=FALSE) %>%
6 layer_flatten()
7 #
8 attention = word2vec %>%
9 time_distributed(layer_dense(units=b, activation='tanh')) %>%
10 time_distributed(layer_dense(units=1, activation='linear',
11 use_bias=FALSE)) %>%
12 layer_flatten() %>%
13 layer_dense(unit=T, activation='softmax', weights=list(diag(T)),
14 use_bias=FALSE, trainable=FALSE)
15 #
16 response = list(attention, word2vec) %>% layer_dot(axes=1) %>%
17 layer_dense(units=20, activation='tanh') %>%
18 layer_dense(units=15, activation='tanh') %>%
19 layer_dense(units=9, activation='softmax')
20 #
21 model = keras_model(inputs = c(input), outputs = c(response))
```
**Table 10.3** Hazard prediction results summarized in deviance losses and misclassification rates


**Fig. 10.9** Confusion matrices of the hazard type prediction (lhs) using an attention layer on the word2vec embeddings with *b* = 10, and (rhs) using an attention layer on the pre-trained GloVe embeddings with *b* = 50; columns show the observations and rows show the predictions

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 11 Selected Topics in Deep Learning**

## **11.1 Deep Learning Under Model Uncertainty**

We revisit claim size modeling in this section. Claim size modeling is challenging because often there is no (simple) off-the-shelf distribution that allows one to appropriately describe all claim size observations. E.g., the main body of the claim size data may look like gamma distributed, and, at the same time, large claims seem to be more heavy-tailed (contradicting a gamma model assumption). Moreover, different product and claim types may lead to multi-modality in the claim size densities. In Sects. 5.3.7 and 5.3.8 we have explored a gamma and an inverse Gaussian GLM to model a motorcycle claims data set. In that example, the results have been satisfactory because this motorcycle data is neither multi-modal nor does it have heavy tails. These two GLM approaches have been based on the EDF (2.14), modeling the mean *x* → *μ(x)* with a regression function and assuming a constant dispersion parameter *ϕ >* 0. There are two natural ways to extend this approach. One considers a double GLM with a dispersion submodel *x* → *ϕ(x)*, see Sect. 5.5, the other explores multi-parameter extensions like the generalized inverse Gaussian model, which is a *k* = 3 vector-valued EF, see (2.10), or the GB2 family that involves 4 parameters, see (5.79). These extensions provide more complexity, also in MLE. In this section, we are not going to consider multi-parameter extensions, but in a first step we aim at robustifying (mean) parameter estimation within the EDF. In a second step we are going to analyze the resulting dispersion *ϕ(x)*. For these steps, we perform representation learning and parameter estimation under model uncertainty by simultaneously considering multiple models from Tweedie's family. These considerations are closely related to Tweedie's forecast dominance given in Definition 4.22.

We emphasize that we remain within a single distribution function choice in this section, i.e., we neither consider mixture distributions nor composite models in this section. Mixture density networks are going to be considered in Sect. 11.6, below, and a composite model approach is studied in Sect. 11.3, below. These mixture density networks and composite models allow us to model the body and the tail of the data with different distribution functions by either mixing or concatenating suitable distributions.

## *11.1.1 Recap: Tweedie's Family*

Tweedie's family with power variance function *V (μ)* <sup>=</sup> *<sup>μ</sup>p*, *<sup>p</sup>* <sup>≥</sup> 2, provides us with a rich model class for claim size modeling if the claim sizes are strictly positive, a.s., and extending to *p* ∈ *(*1*,* 2*)* allows us to model claims with a positive point mass in 0. This class of distribution functions contains the gamma case (*p* = 2) and the inverse Gaussian case (*p* = 3). In general, *p >* 2 provides us with positive stable generated distributions and *p* ∈ *(*1*,* 2*)* gives Tweedie's CP models, see Table 2.1. Tweedie's family has cumulant function for *p >* 1

$$\kappa(\theta) = \kappa\_p(\theta) = \begin{cases} \frac{1}{2-p} \left( (1-p)\theta \right)^{\frac{2-p}{1-p}} & \text{for } p > 1 \text{ and } p \neq 2, \\\ -\log(-\theta) & \text{for } p = 2, \end{cases} \tag{11.1}$$

on the effective domain *θ* ∈ ∈ *(*−∞*,* 0*)* for *p* ∈ *(*1*,* 2], and *θ* ∈ ∈ *(*−∞*,* 0] for *p >* 2. The mean and the power variance function are for *p >* 1 given by

$$\theta \mapsto \mu = \mu(\theta) = ((1-p)\theta)^{\frac{1}{1-p}} \qquad \text{and} \qquad \mu \mapsto V(\mu) = \mu^p.$$

The unit deviance takes the following form for *p >* 1 and *p* = 2, see (4.18),

$$\mathfrak{d}\_{p}(\mathbf{y},\mu) = 2\left(\mathbf{y}\frac{\mathbf{y}^{1-p} - \mu^{1-p}}{1-p} - \frac{\mathbf{y}^{2-p} - \mu^{2-p}}{2-p}\right) \tag{11.2}$$

and in the gamma case *p* = 2 we have, see Table 4.1,

$$\mathfrak{d}\_2(\mathbf{y}, \mu) = 2\left(\frac{\mathbf{y}}{\mu} - 1 + \log\left(\frac{\mu}{\mathbf{y}}\right)\right) \ge 0. \tag{11.3}$$

Figure 11.1 (lhs) shows the unit deviances *<sup>y</sup>* <sup>→</sup> <sup>d</sup>*p(y, μ)* for fixed mean parameter *μ* = 2 and power variance parameters *p* ∈ {0*,* 2*,* 2*.*5*,* 3*,* 3*.*5}; the case *p* = 0 corresponds to the symmetric Gaussian case <sup>d</sup>0*(y, μ)* <sup>=</sup> *(y* <sup>−</sup> *μ)*2. We observe that with an increasing power variance parameter *p* large claims *Y* = *y* receive a smaller loss punishment (if we interpret the unit deviance as a loss function). This is the situation where we have a fixed mean *μ* and where we assess claim sizes

**Fig. 11.1** (lhs) Unit deviances *<sup>y</sup>* <sup>→</sup> <sup>d</sup>*p(y, μ)* <sup>≥</sup> <sup>0</sup> for fixed mean *<sup>μ</sup>* <sup>=</sup> 2 and (rhs) unit deviances *<sup>μ</sup>* <sup>→</sup> <sup>d</sup>*p(y, μ)* <sup>≥</sup> <sup>0</sup> for fixed observation *<sup>y</sup>* <sup>=</sup> 2 for power variance parameters *p* ∈ {0*,* 2*,* 2*.*5*,* 3*,* 3*.*5}

*Y* = *y* relative to this mean. For estimation purposes we have fixed observations *Y* = *y* and we study the sensitivities in *μ*. Note that, in general, the unit deviances d*p(y, μ)* are not symmetric in *y* and *μ*. This second case is shown in Fig. 11.1 (rhs), and the general behavior in *p* is similar. As a result, by selecting different hyperparameters *p >* 1, we can control the influence of large (and small) claims on parameter estimation, because the unit deviances <sup>d</sup>*p(y,*·*)* have different slopes for different *p*'s. Basically, the choice of the loss function (unit deviance) determines the choice of the underlying distributional model, which then assesses the claim observations *Y* = *y* according to their sizes and how these sizes match the model assumptions made.

In Lemma 2.22 we have seen that the unit deviances <sup>d</sup>*<sup>p</sup> (y,μ)* <sup>≥</sup> <sup>0</sup> are zero if and only if *y* = *μ*. The second derivatives given in Lemma 2.22 allow us to consider a second order Taylor expansion around a minimum *μ*<sup>0</sup> = *y*<sup>0</sup>

$$\mathfrak{d}\_{\mathbb{P}}\left(\mathfrak{y}\_{0} + \epsilon \mathfrak{y}, \mu\_{0} + \epsilon \mu\right) = \frac{\epsilon^{2}}{\mu\_{0}^{\mathbb{P}}} \left(\mathfrak{y} - \mu\right)^{2} + o(\epsilon^{2}) \qquad \text{ as } \epsilon \to 0.$$

Thus, locally around the minimum the unit deviances behave symmetric and like Gaussian squares, but this is only a local approximation around a minimum *μ*<sup>0</sup> = *y*<sup>0</sup> as can be seen from Fig. 11.1. I.e., in general, model fitting turns out to be rather different from the Gaussian square loss if we have small and large claim sizes under choices *p >* 1.

**unit deviances of power variance examples**

#### *Remarks 11.1*


$$
\mathfrak{d}\_p(\lambda \mathfrak{y}, \lambda \mu) = \lambda^{2-p} \mathfrak{d}\_p(\mathfrak{y}, \mu). \tag{11.4}
$$

This influences the dispersion estimation for the cases different from the gamma case *p* = 2, see, e.g., saddlepoint approximation (5.60)–(5.62). This also relates to the different parametrizations in Sect. 5.3.8 where we study the inverse Gaussian model *p* = 3, which has a dispersion *ϕi* = 1*/αi* in the reproductive form and *ϕi* <sup>=</sup> <sup>1</sup>*/α*<sup>2</sup> *<sup>i</sup>* in parametrization (5.51).

• We only consider power variance parameters *p >* 1 in this section for nonnegative claim size modeling. Technically, this analysis could be extended to *p* ∈ {0*,* 1}. We do not consider the Gaussian case *p* = 0 to exclude negative claims, and we do not consider the Poisson case *p* = 1 because this is used for claim counts modeling.

We recall that unit deviances of the EDF are equal to twice the corresponding KL divergences, which in turn are special cases of Bregman divergences. From Theorem 4.19 we know that Bregman divergences *Dψ* are the only strictly consistent loss/scoring functions for mean estimation.

**Lemma 11.2** *Choose p >* 1*. The scaled unit deviance* d*p(y, μ)/*2 *is a Bregman divergence Dψp (y, μ) on* <sup>R</sup><sup>+</sup> <sup>×</sup> <sup>R</sup><sup>+</sup> *with strictly decreasing and strictly convex* *function on* <sup>R</sup><sup>+</sup>

$$
\psi\_p(\mathbf{y}) = \mathbf{y}h\_p(\mathbf{y}) - \kappa\_p(h\_p(\mathbf{y})) = \begin{cases}
\frac{1}{(2-p)(1-p)} \mathbf{y}^{2-p} & \text{for } p > 1 \text{ and } p \neq 2, \\
\end{cases}
$$

*for canonical link hp(y)* = *(κ p)*−1*(y)* <sup>=</sup> *<sup>y</sup>*1−*p/(*<sup>1</sup> <sup>−</sup> *p).*

*Proof of Lemma 11.2* The Bregman divergence property follows from (2.29). For *p >* 1 and *y >* 0 we have the strictly decreasing property

$$
\psi\_p'(\mathbf{y}) = h\_p(\mathbf{y}) = \mathbf{y}^{1-p}/(1-p) < 0.
$$

The second derivative is *ψ p(y)* = *h p(y)* <sup>=</sup> *<sup>y</sup>*−*<sup>p</sup>* <sup>=</sup> <sup>1</sup>*/V (y) >* <sup>0</sup> which provides the strict convexity.

In the Gaussian case we have *<sup>ψ</sup>*0*(y)* <sup>=</sup> *<sup>y</sup>*2*/*2, and *<sup>ψ</sup>* <sup>0</sup>*(y) >* <sup>0</sup> on <sup>R</sup><sup>+</sup> implies that this is a strictly increasing convex function for positive claims *y >* 0. This is different to Lemma 11.2.

Assume we have independent observations *(Yi, xi)* following the same Tweedie's distribution, and with means given by *μ<sup>ϑ</sup> (xi)* for some parameter *ϑ*. The M-estimator of *ϑ* using this Bregman divergence is given by

$$\widehat{\mathfrak{d}}^{\flat} = \underset{\mathfrak{d}}{\text{arg}\max} \,\ell\_Y(\mathfrak{d}) \;= \underset{\mathfrak{d}}{\text{arg}\min} \sum\_{i=1}^n \frac{v\_i}{\varphi} \, D\_{\psi\_{\mathcal{P}}} \left( Y\_i, \mu\_{\mathfrak{d}}(\mathfrak{x}\_i) \right).$$

If we turn this M-estimator into a Z-estimator (supposed we have differentiability), the parameter estimate *<sup>ϑ</sup>* is found as a solution of the score equations

$$\begin{split} 0 & \stackrel{!}{=} -\nabla\_{\theta} \sum\_{i=1}^{n} \frac{\upsilon\_{i}}{\varphi} D\_{\psi\_{p}} \left( Y\_{i}, \mu\_{\theta} (\mathbf{x}\_{i}) \right) \\ &= \sum\_{i=1}^{n} \frac{\upsilon\_{i}}{\varphi} \psi\_{p}'' (\mu\_{\theta} (\mathbf{x}\_{i})) \left( Y\_{i} - \mu\_{\theta} (\mathbf{x}\_{i}) \right) \nabla\_{\theta} \mu\_{\theta} \left( \mathbf{x}\_{i} \right) \\ &= \sum\_{i=1}^{n} \frac{\upsilon\_{i}}{\varphi} \frac{Y\_{i} - \mu\_{\theta} (\mathbf{x}\_{i})}{V(\mu\_{\theta} (\mathbf{x}\_{i}))} \nabla\_{\theta} \mu\_{\theta} (\mathbf{x}\_{i}) \\ &= \sum\_{i=1}^{n} \frac{\upsilon\_{i}}{\varphi} \frac{Y\_{i} - \mu\_{\theta} (\mathbf{x}\_{i})}{\mu\_{\theta} (\mathbf{x}\_{i})^{p}} \nabla\_{\theta} \mu\_{\theta} (\mathbf{x}\_{i}). \end{split} \tag{11.5}$$

In the GLM case this exactly corresponds to (5.9). To determine the Z-estimator from (11.5), we scale the residuals *Yi* − *μi* inversely proportional to the variances *V (μi)* <sup>=</sup> *<sup>μ</sup><sup>p</sup> <sup>i</sup>* of the chosen Tweedie's distribution. It is a well-known result that if we scale individual unbiased estimators inversely proportional to their variances, we receive the unbiased estimator with minimal variance, we come back to this in (11.16), below. This gives us the intuition behind a specific choice of the power variance parameter for mean estimation, as the sizes of the variances *μ<sup>p</sup> <sup>i</sup>* scale (weight) the observed residuals *Yi* − *μi*, and balance potential outliers in the observations correspondingly.

## *11.1.2 Lab: Claim Size Modeling Under Model Uncertainty*

We present a proposal for deep learning under model uncertainty in this section. We explain this on an explicit example within Tweedie's distributions. We emphasize that this methodology can be applied in more generality, but it is beneficial here to have an explicit example in mind to illustrate the different phenomena.

#### **Generalized Linear Models**

We analyze a Swiss accident insurance claims data set. This data is illustrated in Sect. 13.4, and an excerpt of the data is given in Listing 13.7. In total we have 339'500 claims with positive payments. We choose this data set because it ranges from very small claims of 1 CHF to very large claims, the biggest one exceeding 1'300'000 CHF. These claims are supported by feature information such as the labor sector, the injury type or the injured body part, see Listing 13.7 and Fig. 13.25. For our analysis, we partition the data into a learning data set *L* and a test data set *T* . We do this partition stratified w.r.t. the claim sizes and in a ratio of 9 : 1. This results in a learning data set *L* of size *n* = 305 550 and in a test data set *T* of size *T* = 33 950.

We consider three Tweedie's distributions with power variance parameters *p* ∈ {2*,* 2*.*5*,* 3}, the first one is the gamma model, the last one the inverse Gaussian model, and the power variance parameter *p* = 2*.*5 gives a model in between. In a first step we consider GLMs, this requires feature engineering. We have three categorical features, one binary feature and two continuous ones. For the categorical and binary features we use dummy coding, and the continuous features Age and AccQuart are just included in its raw form. As link function *g* we choose the log-link which respects the positivity of the dual mean parameter space *M*, see Table 2.1, but this is not the canonical link of the selected models. In the gamma GLM this leads to a convex minimization problem, but in Tweedie's GLM with *p* = 2*.*5


**Table 11.1** In-sample and out-of-sample losses (gamma loss, power variance case *p* = 2*.*5 loss (in 10−2) and inverse Gaussian (IG) loss (in 10−3)) and AIC values; the losses use unit dispersion *ϕ* = 1, AIC relies on the MLE of *ϕ*

and in the inverse Gaussian GLM we have non-convex minimization problems, see Example 5.6. Therefore, we initialize Fisher's scoring method (5.12)in the latter two GLMs with the solution of the gamma GLM. The gamma and the inverse Gaussian cases can directly be fitted with the R command glm [307], for the power variance parameter case *p* = 2*.*5 we have coded our own MLE routine using Fisher's scoring method.

Table 11.1 shows the in-sample losses on the learning data *L* and the corresponding out-of-sample losses on the test data *T* . The fitted GLMs (gamma, power variance parameter *p* = 2*.*5 and inverse Gaussian) are always evaluated on all three unit deviances <sup>d</sup>*p*=2*(y, μ)*, <sup>d</sup>*p*=2*.*5*(y, μ)* and <sup>d</sup>*p*=3*(y, μ)*, respectively. We give some remarks. First, we observe that the in-sample loss is always minimized for the GLM with the same power variance parameter *p* as the loss d*<sup>p</sup>* studied (2.0695, 7.6971 and 3.9398 in bold face). This result simply states that the parameter estimates are obtained by minimizing the in-sample loss (or maximizing the corresponding in-sample log-likelihood). Second, the minimal out-of-sample losses are also highlighted in bold face. From these results we cannot give any preference to a single model w.r.t. Tweedie's forecast dominance, see Definition 4.20. Third, we calculate the AIC values for all models. The gamma and the inverse Gaussian cases have a closed-form solution for the normalizing term *a(y*; *v/ϕ)* in the EDF density, and we can directly calculate AIC. The case *p* = 2*.*5 is more difficult and we use the saddlepoint approximation of Sect. 5.5.2. Considering AIC we give preference to Tweedie's GLM with *p* = 2*.*5. Note that the AIC values use the MLE for *ϕ* which is obtained from a general purpose optimizer, and which uses the saddlepoint approximation in the power variance case *p* = 2*.*5. Fourth, under a constant dispersion parameter *<sup>ϕ</sup>*, the mean estimation *μi* can be done without explicitly specifying *ϕ* because it cancels in the score equations. In fact, we perform this mean estimation in the additive form and not in the reproductive form, see (2.13) and the discussions in Sects. 5.3.7–5.3.8.

Figure 11.2 plots the deviance residuals (for unit dispersion) against the logged fitted means *μ(xi)* for *<sup>p</sup>* ∈ {2*,* <sup>2</sup>*.*5*,* <sup>3</sup>} for 2'000 randomly selected claims; this is the Tukey–Anscombe plot. The green line has been obtained by a spline fit to the deviance residuals as a function of the fitted means *μ(xi)*, and the cyan

**Fig. 11.2** Tukey–Anscombe plots showing the deviance residuals against the logged GLM fitted means *μ(xi)*: (lhs) gamma GLM *<sup>p</sup>* <sup>=</sup> 2, (middle) power variance case *<sup>p</sup>* <sup>=</sup> <sup>2</sup>*.*5, (rhs) inverse Gaussian GLM *p* = 3; the cyan lines show twice the estimated standard deviation of the deviance residuals as a function of the size of the logged estimated means *<sup>μ</sup>*

lines give twice the estimated standard deviation of the deviance residuals as a function of the fitted means (also obtained from spline fits). This estimated standard deviation corresponds to the square-rooted deviance dispersion estimate *<sup>ϕ</sup>*D, see (5.30), however, in the additive form because we work with unscaled claim size observations. A constant dispersion assumption is supported by cyan lines of roughly constant size. In the gamma case the dispersion seems increasing in the mean estimate, and in the inverse Gaussian case it is decreasing, thus, the power variance parameters *p* = 2 and *p* = 3 do not support a constant dispersion in this example. Only the choice *p* = 2*.*5 may support a constant dispersion assumption (because it does not have an obvious trend). This says that the variance should scale as *V (μ)* <sup>=</sup> *<sup>μ</sup>*2*.*<sup>5</sup> as a function of the mean *<sup>μ</sup>*, see also (11.5).

#### **Deep FN Networks**

We compare the above GLMs to FN networks of depth *d* = 3 with *(q*1*, q*2*, q*3*)* = *(*20*,* 15*,* 10*)* neurons. The categorical features are modeled with embedding layers of dimension *b* = 2. We fit this network architecture with Tweedie's deviances losses having power variance parameters *p* ∈ {2*,* 2*.*5*,* 3}. Moreover, we use 20% of the learning data *<sup>L</sup>* as validation data *<sup>V</sup>* to explore the early stopping rule.1 To reduce the randomness coming from early stopping with different seeds, we average the deviance losses over 20 runs (this is not the nagging predictor: we only average the deviance losses to have stable conclusions concerning forecast dominance). The results are presented in Table 11.2.

<sup>1</sup> In the standard implementation of SGD with early stopping, the learning and validation data partition is done non-stratified. If necessary, this can be changed manually.

**Table 11.2** In-sample and out-of-sample losses (gamma loss, power variance case *p* = 2*.*5 loss (in 10−2) and inverse Gaussian (IG) loss (in 10−3)) and average claim amounts; the losses use unit dispersion *ϕ* = 1 and the network losses are averaged deviance losses over 20 runs with different seeds


First, we observe that the networks outperform the GLMs, saying that the feature engineering has not been done optimally for GLMs. Second, in-sample we no longer receive the lowest deviance loss in the model with the same *p*. This comes from the fact that we exercise early stopping, and, for instance, the gamma in-sample loss of the gamma network (*p* = 2) 1.9738 is bigger than the corresponding gamma loss of 1.9712 from the network with *p* = 2*.*5. Third, considering forecast dominance, preference is given either to the gamma network or to the power variance parameter *p* = 2*.*5. In general, it seems that fitting with higher power variance parameters leads to less stable results, but this statement needs more analysis. The disadvantage of this fitting approach is that we independently fit the models with the different power variance parameters to the observations, and, thus, the learned representations *z(d*:1*) (xi)* are rather different for different *p*'s. This makes it difficult to compare these models. This is exactly the point that we address next.

#### **Robustified Representation Learning**

To deal with the drawback of missing comparability of the network approaches with different power variance parameters, we can try to learn a representation that simultaneously fits different models. The implementation of this idea is rather straightforward in network modeling. We choose the above network of depth *d* = 3, which gives us the new (learned) representation *<sup>z</sup><sup>i</sup>* <sup>=</sup> *<sup>z</sup>(d*:1*) (xi)* in the last FN layer. The general idea now is that we design multiple outputs for this learned representation to fit the different distributional models. That is, in the case of three Tweedie's loss functions with power variance parameters *p* ∈ {2*,* 2*.*5*,* 3} we consider a three-dimensional output mapping

$$\mathbf{x} \mapsto \left(\mu\_{p=2}(\mathbf{x}), \mu\_{p=2.5}(\mathbf{x}), \mu\_{p=3}(\mathbf{x})\right)^{\top} \tag{11.6}$$
 
$$= \left(\mathbf{g}^{-1}\langle\mathfrak{G}\_2, \mathbf{z}^{(d:\mathbb{I})}(\mathbf{x})\rangle, \mathbf{g}^{-1}\langle\mathfrak{G}\_{2.5}, \mathbf{z}^{(d:\mathbb{I})}(\mathbf{x})\rangle, \mathbf{g}^{-1}\langle\mathfrak{G}\_3, \mathbf{z}^{(d:\mathbb{I})}(\mathbf{x})\rangle\right)^{\top} \in \mathbb{R}^3,$$

for different output parameters *<sup>β</sup>*2*, <sup>β</sup>*2*.*5*, <sup>β</sup>*<sup>3</sup> <sup>∈</sup> <sup>R</sup>*qd*+1. These three expected responses (11.6) share the network parameters *<sup>w</sup>* <sup>=</sup> *(w(*1*)* <sup>1</sup> *,..., <sup>w</sup>(d) qd )* in the FN layers, and the network fitting should learn these parameters such that *z<sup>i</sup>* = *z(d*:1*) (xi)* gives a good representation for all considered loss functions. Choose positive weights *ηp >* 0, and define the combined deviance loss function

$$\mathfrak{D}\left(\mathbf{Y}, (\mathfrak{w}, \mathfrak{f}\_2, \mathfrak{f}\_{2.5}, \mathfrak{f}\_3)\right) \\ = \sum\_{p \in \{2, 2.5, 3\}} \frac{\eta\_p}{\varphi\_p} \sum\_{l=1}^n v\_l \mathfrak{d}\_p\left(\mathbf{Y}\_l, \mu\_p(\mathbf{x}\_l)\right), \qquad (11.7)$$

for the given observations *(Yi, xi, vi)*, 1 ≤ *i* ≤ *n*. Note that the unit deviances d*<sup>p</sup>* live on different scales for different *p*'s. We use the (constant) weights *ηp >* 0 to balance these scales so that all power variance parameters *p* roughly equally contribute to the total loss, while setting *ϕp* ≡ 1 (which can be done for a constant dispersion). This approach is now fitted to the available learning data *L*. The corresponding R code is given in Listing 11.1. Note that the fitting also requires that we triplicate the observations*(Yi, Yi, Yi)* so that we can simultaneously evaluate the three chosen power variance deviance losses, see lines 18–21 of Listing 11.1. We fit this model to the Swiss accident insurance data, and the results are presented in Table 11.3 on the lines called 'multi-out'.

**Listing 11.1** FN network with multiple output

```
1 Design = layer_input(shape = c(q0), dtype = 'float32', name = 'Design')
2 #
3 Network = Design %>%
4 layer_dense(units=20, activation='tanh', name='FNLayer1') %>%
5 layer_dense(units=15, activation='tanh', name='FNLayer2') %>%
6 layer_dense(units=10, activation='tanh', name='FNLayer3')
7 #
8 Output1 = Network %>%
9 layer_dense(units=1, activation='exponential', name='Output1')
10 #
11 Output2 = Network %>%
12 layer_dense(units=1, activation='exponential', name='Output2')
13 #
14 Output3 = Network %>%
15 layer_dense(units=1, activation='exponential', name='Output3')
16
17 #
18 keras_model(inputs = c(Design), outputs = c(Output1, Output2, Output3))
19 #
20 model %>% compile(loss = list(loss1, loss2, loss3),
21 loss_weights=list(eta1, eta2, eta3), optimizer = 'nadam')
```
This simultaneous representation learning across different loss functions leads to more stability in the results between the different loss function choices, i.e., there is less variability between the losses of the different outputs compared to fitting the three different models independently. The predictive performance seems slightly better in this robustified vs. the independent case (see bold face out-of-sample figures). The similarity of the results across the different loss functions (using the

**Table 11.3** In-sample and out-of-sample losses (gamma loss, power variance case *p* = 2*.*5 loss (in 10−2) and inverse Gaussian (IG) loss (in 10−3)) and average claim amounts; the losses use unit dispersion *ϕ* = 1 and the network losses are averaged deviance losses over 20 runs with different seeds


**Fig. 11.3** Ratios *μp*=2*(xi)/ μp*=2*.*5*(xi)* (black color) and *μp*=3*(xi)/ μp*=2*.*5*(xi)* (blue color) of the three predictors (lhs) in-sample figures ordered on the *x*-axis w.r.t. the logged observed claims *Yi*, darkgray and cyan lines give spline fits, (rhs) out-of-sample figures ordered on the *x*-axis w.r.t. the logged average size of the three predictors

jointly learned representation *zi*) allows us to directly compare the corresponding predictors *μp(xi)* for the different *<sup>p</sup>*'s.

Figure 11.3 compares the three predictors by considering the ratios *μp*=2*(xi)/ μp*=2*.*5*(xi)* in black color and *μp*=3*(xi)/ μp*=2*.*5*(xi)* in blue color, i.e., we divide by the (middle) predictor with power variance parameter *p* = 2*.*5. The figure on the left-hand side shows these ratios in-sample and ordered on the *x*-axis w.r.t. the observed claim sizes *Yi*, and the darkgray and cyan lines give spline fits to these ratios. The figure on the right-hand side shows these ratios out-of-sample and ordered on the *x*-axis w.r.t. the average predictors *<sup>μ</sup>*¯*<sup>i</sup>* <sup>=</sup> *( μp*=2*(xi)* <sup>+</sup> *μp*=2*.*5*(xi)* <sup>+</sup> *μp*=3*(xi))/*3. In view of (11.5) we expect that the models with a smaller power variance parameter *p* over-fit more to large claims. From Fig. 11.3 (lhs) we can observe that, indeed, this is the case (see gray and cyan spline fits which bifurcate for large claims). That is, models with a smaller power variance parameter react more sensitively to large observations *Yi*. The ratios in Fig. 11.3 provide differences of up to 7% for large claims.

*Remark 11.3* The loss function (11.7) can also be interpreted as regularization. For instance, if we choose *η*<sup>2</sup> = 1, and if we assume that this is our preferred model, then we can regularize this model with further models, and their weights *ηp >* 0 determine the degree of regularization. Thus, in contrast to ridge and LASSO regularization of Sect. 6.2, regularization does not directly act on the model parameters, here, but rather on what we learn in terms of the representation *<sup>z</sup><sup>i</sup>* <sup>=</sup> *<sup>z</sup>(d*:1*) (xi)*.

#### **Using Forecast Dominance to Deal with Model Uncertainty**

In GLMs, the power variance parameter *p* typically acts as a hyper-parameter, i.e., one fits different GLMs for different choices of *p*. Model selection is then done, e.g., by analyzing the Tukey–Anscombe plot, AIC, cross-validation or by studying outof-sample forecast dominance. In networks we should not use AIC as we neither have a parsimonious network parameter nor do we use the MLE. Here, we focus on forecast dominance for the network predictors (based on the different chosen power variance parameters). If we are mainly interested in receiving a model that provides optimal forecast dominance, we should not consider three different outputs as in (11.7), but rather fit the same output to different loss functions; the required changes are minimal, see Listing 11.2. Namely, consider one FN network with one output *μ(xi)*, but evaluate this output simultaneously on the different chosen loss functions

$$\mathfrak{D}\left(Y,\mathfrak{d}\right) \;=\sum\_{p\in\{2,2,5,3\}}\frac{\eta\_p}{\varphi\_p}\sum\_{l=1}^n v\_l\,\mathfrak{d}\_p\left(Y\_l,\mu(\mathfrak{x}\_l)\right).\tag{11.8}$$

In contrast to (11.7), we only have one FN network regression function *x<sup>i</sup>* → *μ(xi)*, here.

We present the results on the last line of Table 11.3, called 'multi-loss'. In our case, this approach is slightly less competitive (out-of-sample), however, it is less sensitive to outliers since we need to have a good regression function simultaneously for multiple loss functions. Of course, this multiple loss fitting approach is not restricted to different power variance parameters. As stated in Theorem 4.19, Bregman divergences are the only consistent loss functions for mean estimation, and the unit deviances are examples of Bregman divergences. Forecast dominance now suggests that we may choose any Bregman divergence as a loss function in Listing 11.2 as long as it reflects the expected properties of the model (and of **Listing 11.2** FN network with a single output for multiple losses

```
1 Design = layer_input(shape = c(q0), dtype = 'float32', name = 'Design')
2 #
3 Network = Design %>%
4 layer_dense(units=20, activation='tanh', name='FNLayer1') %>%
5 layer_dense(units=15, activation='tanh', name='FNLayer2') %>%
6 layer_dense(units=10, activation='tanh', name='FNLayer3')
7 #
8 Output = Network %>%
9 layer_dense(units=1, activation='exponential', name='Output')
10 #
11 keras_model(inputs = c(Design), outputs = c(Output, Output, Output))
12 #
13 model %>% compile(loss = list(loss1, loss2, loss3),
14 loss_weights=list(eta1, eta2, eta3), optimizer = 'nadam')
```
the observed data), otherwise we will receive bad convergence properties, see also Sect. 11.1.4, below. For instance, we can robustify the Poisson claim counts model by additionally considering the deviance loss of the negative binomial model that also assesses over-dispersion.

#### **Nagging Predictor**

The loss figures in Table 11.3 are averaged deviance losses over 20 different runs of the gradient descent algorithm with different seeds (to receive stable results). Rather than averaging over the losses, we should improve the models by averaging over the predictors and, then, calculate the losses on these averaged predictors; this is exactly the proposal of the nagging predictor (7.44). We calculate the nagging predictor of the models that are simultaneously fit to the different loss functions (lines 'multioutput' and 'multi-loss' of Table 11.3). The resulting nagging predictors are reported in Table 11.4. This table shows that we give a clear preference to the nagging predictors. The simultaneous loss fitting (11.8) gives the best out-of-sample results for the nagging predictor, see the last line of Table 11.4.

Figure 11.4 shows the Tukey–Anscombe plot of the multi-loss nagging predictor for the different deviance losses (for unit dispersion). Again, the case *p* = 2*.*5 is closest to having a constant dispersion, and the other cases will require dispersion modeling *ϕ(x)*.

Figure 11.5 shows the empirical auto-calibration property of the multi-loss nagging predictor. This auto-calibration property is calculated as in Listing 7.8. We observe that the auto-calibration property holds rather accurately. Only for claim predictors *μ(xi)* above 10'000 CHF (vertical dotted line in Fig. 11.5) the fitted means underestimate the observed average claim sizes. This affects (only) 1.7% of all claims and it could be corrected as described in Example 7.19.

**Table 11.4** In-sample and out-of-sample losses (gamma loss, power variance case *p* = 2*.*5 loss (in 10−2) and inverse Gaussian (IG) loss (in 10−3)) and average claim amounts; the losses use unit dispersion *ϕ* = 1


**Fig. 11.4** Tukey–Anscombe plots giving the deviance residuals of the multi-loss nagging predictor of Table 11.4 for different power variance parameters: (lhs) gamma deviances *p* = 2, (middle) power variance deviances *p* = 2*.*5, (rhs) inverse Gaussian deviances *p* = 3; the cyan lines show twice the estimated standard deviation of the deviance residuals as a function of the size of the logged estimated means *<sup>μ</sup>*

## *11.1.3 Lab: Deep Dispersion Modeling*

From the Tukey–Anscombe plots in Fig. 11.4 we conclude that the dispersion requires regression modeling, too, as the dispersion does not seem to be constant over the whole range of the expected claim sizes. We therefore explore a *double FN network model*, in spirit this is similar to the double GLM of Sect. 5.5. We therefore assume to work within Tweedie's family with power variance parameters *p* ≥ 2, and with unit deviances given by (11.2)–(11.3). The saddlepoint approximation (5.59) gives us

$$f(\mathbf{y}; \theta, v/\varphi) \approx \left(\frac{2\pi\varphi}{v}V(\mathbf{y})\right)^{-1/2} \exp\left\{-\frac{1}{2\varphi/v}\mathfrak{d}\_p(\mathbf{y}, \mu)\right\},$$

with power variance function *V (y)* <sup>=</sup> *<sup>y</sup>p*. This saddlepoint approximation is formulated in the reproductive form for *Y* = *X/ω* = *Xϕ/v*. This requires scaling of the observations *X* with the unknown *ϕ* to receive *Y* . In Sect. 5.5.4 we have shown how this problem can be solved. In this section we give a different proposal which is more robust in network fitting, and which benefits from the *b*-homogeneity of d*p*, see (11.4).

We consider the variable transformation *y* → *x* = *yω* = *yv/ϕ*. In the absolutely continuous case *p* ≥ 2 this gives us the approximation

$$f(\mathbf{x};\theta,\upsilon/\varphi) \approx \left(\frac{2\pi\varphi^{1+p}}{\upsilon^{1+p}}V(\mathbf{x})\right)^{-1/2} \exp\left\{-\frac{1}{2\varrho/\upsilon}\mathfrak{d}\_p\left(\frac{\mathbf{x}\varphi}{\upsilon},\frac{\mu\varphi\upsilon}{\varphi\upsilon}\right)\right\}\frac{\varphi}{\upsilon},$$

$$= \left(\frac{2\pi\varphi^{p-1}}{\upsilon^{p-1}}V(\mathbf{x})\right)^{-1/2}\exp\left\{-\frac{1}{2\varrho^{p-1}/\upsilon^{p-1}}\mathfrak{d}\_p\left(\mathbf{x},\mu\_p\right)\right\},$$

with mean *μp* <sup>=</sup> *μv/ϕ* of *<sup>X</sup>* <sup>=</sup> *Y v/ϕ*. We set *<sup>φ</sup>* = −1*/ϕp*−<sup>1</sup> *<sup>&</sup>lt;* 0. This gives us the approximation

$$\ell\_X(\mu\_p, \phi) \approx \frac{v^{p-1} \mathfrak{d}\_p(X, \mu\_p) \phi - (-\log \left( -\phi \right))}{2} - \frac{1}{2} \log \left( \frac{2\pi}{v^{p-1}} V(X) \right). \tag{11.9}$$

For given mean *μp* we again have a gamma approximation on the right-hand side, but we scale the dispersion differently. This gives us the approximate first moment

$$\mathbb{E}\_{\phi} \left[ \left. v^{p-1} \mathfrak{d}\_p(X, \mu\_p) \right| \mu\_p \right] \approx \kappa\_2'(\phi) = -1/\phi = \varphi^{p-1} \stackrel{\text{def.}}{=} \varphi\_p.$$

The remainder of this modeling is similar to the residual MLE approach in Section 5.5.3. Namely, we set up two FN network regression functions

$$\mathbf{x} \mapsto \mu\_p(\mathbf{x}) \qquad \text{and} \qquad \mathbf{x} \mapsto \varphi\_p(\mathbf{x}) = \kappa\_2'(\phi(\mathbf{x})) = -1/\phi(\mathbf{x}).$$

Parameter fitting is achieved by alternating the network parameter fitting of *μp(x)* and *ϕp(x)* see also Section 5.5.4. We start the iteration by setting the dispersion constant to *<sup>ϕ</sup>(*0*) <sup>p</sup> (x)* ≡ const. In this case, the dispersion cancels in the score equations and the mean *<sup>μ</sup>(*1*) <sup>p</sup> (x)* can be estimated without the explicit knowledge of the (constant) dispersion parameter *<sup>ϕ</sup>(*0*) <sup>p</sup>* ; this exactly provides the results of the previous Sect. 11.1.2. Then, we iterate this procedure for *t* ≥ 1. For given mean estimate *<sup>μ</sup>(t ) <sup>p</sup> (x)* we receive deviances *<sup>v</sup>p*−1d*p(X, <sup>μ</sup>(t ) <sup>p</sup> (x))*, and this allows us to estimate *<sup>ϕ</sup>(t ) <sup>p</sup> (x)* from the approximate gamma model (11.9), and for given dispersion parameters *<sup>ϕ</sup>(t ) <sup>p</sup> (x)* we estimate *<sup>μ</sup>(t*+1*) <sup>p</sup> (x)* from the corresponding Tweedie's model for the observation *X*.

*Example 11.4* We revisit the Swiss accident insurance data example of Sect. 11.1.2, and we use the robustified representation learning approach (11.7) that simultaneously fits Tweedie's models for the power variance parameters *p* = 2*,* 2*.*5*,* 3. The initial calibration step is done for constant dispersions *<sup>ϕ</sup>(*0*) <sup>p</sup> (x)* ≡ const, and it provides us with the estimated means *<sup>μ</sup>(*1*) <sup>p</sup> (x)* as illustrated in Fig. 11.3. For stability reasons we choose the nagging predictor averaging over 20 different SGD runs with 20 different seeds. These estimated means *<sup>μ</sup>(*1*) <sup>p</sup> (x)* give us the deviances *<sup>v</sup>p*−1d*p(X, <sup>μ</sup>(*1*) <sup>p</sup> (x))*.

Using these deviances allows us to alternate the dispersion and mean estimation for *<sup>t</sup>* <sup>≥</sup> 1. For given means *<sup>μ</sup>(t ) <sup>p</sup> (x)*, *p* = 2*,* 2*.*5*,* 3, we set up a deep FN network *<sup>x</sup>* <sup>→</sup> *<sup>z</sup>(d*:1*) (x)* that allows for a robustified deep dispersion learning *ϕp(x)*, for *p* = 2*,* 2*.*5*,* 3. Under the log-link choice we consider the regression function with multiple outputs

$$\mathbf{x} \mapsto \left(\varphi\_{p=2}(\mathbf{x}), \varphi\_{p=2.5}(\mathbf{x}), \varphi\_{p=3}(\mathbf{x})\right)^{\top} \tag{11.10}$$
 
$$= \left(\exp\langle\mathfrak{a}\_2, \mathbf{z}^{(d:1)}(\mathbf{x})\rangle, \exp\langle\mathfrak{a}\_{2.5}, \mathbf{z}^{(d:1)}(\mathbf{x})\rangle, \exp\langle\mathfrak{a}\_3, \mathbf{z}^{(d:1)}(\mathbf{x})\rangle\right)^{\top} \in \mathbb{R}\_+^3,$$

for different output parameters *<sup>α</sup>*2*, <sup>α</sup>*2*.*5*, <sup>α</sup>*<sup>3</sup> <sup>∈</sup> <sup>R</sup>*qd*+1. These three dispersion responses (11.10) share the common network parameter *<sup>w</sup>*\* <sup>=</sup> *(w*\**(*1*)* <sup>1</sup> *,..., <sup>w</sup>*\**(d) qd )* in the FN layers of *z(d*:1*)* . The network fitting learns these parameters simultaneously for the different power variance parameters. Choose positive weights \**ηp <sup>&</sup>gt;* 0, and define the combined deviance loss function (based on the gamma model *κ*<sup>2</sup> and having dispersion parameter 2)

$$\mathfrak{D}\left(\mathfrak{d}(X,\widehat{\mathfrak{\mu}}^{(t)}),(\widetilde{\mathfrak{w}},\mathfrak{a}\_{2},\mathfrak{a}\_{2.5},\mathfrak{a}\_{3})\right) = \sum\_{p\in\{2,2,5,3\}}\frac{\widetilde{\eta}\_{p}}{2}\sum\_{i=1}^{n}\mathfrak{d}\_{2}\left(v\_{i}^{p-1}\mathfrak{d}\_{p}(X\_{i},\widehat{\mathfrak{\mu}}\_{p}^{(t)}(\mathbf{x}\_{i})),\varphi\_{p}(\mathbf{x}\_{i})\right),\tag{11.11}$$

where *X* = *(X*1*,...,Xn)* collects the unscaled observations *Xi* = *Yivi/ϕi*. Thus, for all power variance parameters *<sup>p</sup>* <sup>=</sup> <sup>2</sup>*,* <sup>2</sup>*.*5*,* 3 we fit a gamma model <sup>d</sup>2*(*·*,*·*)/*<sup>2</sup> to the observed deviances (observations) *v p*−1 *<sup>i</sup>* <sup>d</sup>*p(Xi, <sup>μ</sup>(t ) <sup>p</sup> (xi))* providing us with the estimated dispersions *<sup>ϕ</sup>(t ) <sup>p</sup> (xi)*. This fitting step is received by the R code of Listing 11.1, where the losses on line 20 are all given by gamma deviance losses (11.11) and the deviances *vp*−<sup>1</sup> *<sup>i</sup>* <sup>d</sup>*p(Xi, <sup>μ</sup>(t ) <sup>p</sup> (xi))* play the role of the responses (observations).

In the next step we update the mean estimates *<sup>μ</sup>(t*+1*) <sup>p</sup> (xi)*, given the estimated dispersions *<sup>ϕ</sup>(t ) <sup>p</sup> (xi)* from the previous step. This requires that we optimize the expected responses (11.6) for given heterogeneous dispersion parameters. We therefore consider the loss function for positive weights *ηp >* 0, see (11.7),

$$\mathfrak{D}\left(\mathbf{X},\widehat{\mathfrak{p}}^{(t)},(\mathfrak{w},\mathfrak{f}\_{2},\mathfrak{f}\_{2.5},\mathfrak{f}\_{3})\right) = \sum\_{p \in \{2,2.5,3\}} \eta\_{p} \sum\_{l=1}^{n} \frac{\upsilon\_{l}^{p-1}}{\widehat{\varphi}\_{p}^{(t)}(\mathbf{x}\_{l})} \mathfrak{d}\_{p}\left(\mathbf{X}\_{l},\mu\_{p}(\mathbf{x}\_{l})\right). \tag{11.12}$$

We fit this model by iterating this approach for *t* ≥ 1: we start from the predictors of Sect. 11.1.2 providing us with the first mean estimates *<sup>μ</sup>(*1*) <sup>p</sup> (xi)*. Based on these mean estimates we iterate this robustified estimation of *<sup>ϕ</sup>(t ) <sup>p</sup> (xi)* and *<sup>μ</sup>(t ) <sup>p</sup> (xi)*. We give some remarks:


We iterate this algorithm over two loops, and the results are presented in Table 11.5. We observe a decrease of <sup>−</sup>2*X( <sup>μ</sup>(t ) <sup>p</sup> , <sup>ϕ</sup>(t ) <sup>p</sup> )* by iterating the fitting algorithm for *t* ≥ 1. For AIC, we would have to correct twice the negative log-likelihood by twice

**Table 11.5** Iteration of mean *μ(t ) <sup>p</sup>* and dispersion *<sup>ϕ</sup>(t ) <sup>p</sup>* estimation for the gamma model *p* = 2, the power variance parameter *p* = 2*.*5 model and the inverse Gaussian model *p* = 3: the numbers correspond to <sup>−</sup>2*X( μ(t ) <sup>p</sup> , <sup>ϕ</sup>(t ) <sup>p</sup> )*; the last line corrects <sup>−</sup>2*X( μ(t ) <sup>p</sup> , <sup>ϕ</sup>(t ) <sup>p</sup> )* by 2·2·812 = 3 248 (twice the number of parameters used in the mean and dispersion FN networks)


the number of MLE estimated parameters. We also adjust here correspondingly, though the correction is not justified by any theory, because we do not work with the MLE nor do we have a parsimonious model for mean and dispersion estimation. Nevertheless, we receive smaller values than in Table 11.1 which supports the use of this more complex double FN network model.

Comparing the three power variance parameter models, we now give preference to the inverse Gaussian model, as it has the biggest log-likelihood. Note that we directly compare all power variance models as the complexity is equal in all models (they only differ in the chosen power variance parameter) and the joint robustified fitting applies the same stopping rule to all power variance parameter models. The same result is obtained by comparing the out-of-sample log-likelihoods. Note that we do not compare the deviance losses, here, because the unit deviances are not designed to estimate parameters in vector-valued parameter families; we model dispersion as a second parameter.

Next, we study the estimated dispersions *ϕp(xi)* as a function of the estimated means *μp(xi)*. We fit a spline to *ϕp(xi)* as a function of *μp(xi)*, and we receive estimates that almost perfectly match the cyan lines in Fig. 11.4. This provides a proof of concept that the dispersion regression model finds the right level of dispersion as a function of the expected means.

Using the mean and dispersion estimates, we can calculate the dispersion scaled deviance residuals

$$r\_i^{\mathcal{D}} = \text{sign}(X\_i - \widehat{\mu}\_p(\mathbf{x}\_i)) \sqrt{v\_i^{p-1} \mathfrak{d}\left(X\_i, \widehat{\mu}\_p(\mathbf{x}\_i)\right) / \widehat{\wp}\_p(\mathbf{x}\_i)}. \tag{11.13}$$

This then allows us to give the Tukey–Anscombe plots for the three considered power variance parameters.

The corresponding plots are given in Fig. 11.6; the difference to Fig. 11.4 is that the latter considers unit dispersion whereas the former scales the residuals with the rooted dispersion *ϕp(xi)*; note that *vi* <sup>≡</sup> 1 in this example. By scaling with the rooted dispersion the resulting deviance residuals *r*<sup>D</sup> *<sup>i</sup>* should roughly have unit standard deviation. From Fig. 11.6 we observe that indeed this is the case, the cyan

**Fig. 11.6** Tukey–Anscombe plots giving the dispersion scaled deviance residuals *r*<sup>D</sup> *<sup>i</sup>* (11.13) of the models jointly fitting the mean parameters *μp(xi)* and the dispersion parameters *ϕp(xi)*: (lhs) gamma model, (middle) power variance parameter *p* = 2*.*5 model, and (rhs) inverse Gaussian models; the cyan lines correspond to 2 standard deviations

**Fig. 11.7** (lhs) Gamma model: observations vs. simulations on log-scale, (middle) gamma model: estimated shape parameters *<sup>α</sup>*† *<sup>t</sup>* <sup>=</sup> <sup>1</sup>*/ <sup>ϕ</sup>*2*(x*† *<sup>t</sup> ) <* 1, 1 ≤ *t* ≤ *T* , and (rhs) inverse Gaussian model: observations vs. simulations on log-scale

line shows a spline fit of twice the standard deviation of the deviance residuals *r*<sup>D</sup> *i* . These splines are of magnitude 2 which verifies the unit standard deviation property. Moreover, the cyan lines are roughly horizontal which indicates that the dispersion estimation and the scaling works across all expected claim sizes *μp(xi)*. The three different power variance parameters *p* = 2*,* 2*.*5*,* 3 show different behaviors in the lower and upper tails in the residuals (centering around the orange horizontal zero line in Fig. 11.6) which corresponds to the different distributional properties of the chosen models.

We further analyze the gamma and the inverse Gaussian models. Note that the analysis of the power variance models for general power variance parameters *p* = 0*,* 1*,* 2*,* 3 is more difficult because neither the EDF density nor the EDF distribution function have a closed form. To analyze the gamma and the inverse Gaussian models we simulate observations *X*sim *<sup>t</sup>* , *t* = 1*,...,T* , from the estimated models (using the out-of-sample features *x*† *<sup>t</sup>* of the test data *T* ), and we compare them against the true out-of-sample observations *X*† *<sup>t</sup>* . Figure 11.7 shows the results for the gamma model (lhs) and the inverse Gaussian model (rhs) on the log-scale. A good fit has been achieved if the black dots lie on the red diagonal line (in the colored version), because then the simulated data shares similar features as the observed data. The fit of the inverse Gaussian model seems reasonably good.

On the other hand, we see that the gamma model gives a poor fit, especially in the lower tail. This supports the AIC values of Table 11.5. The problem with the gamma model is that the data is more heavy-tailed than the gamma model can accomplish. As a consequence, the dispersion parameter estimates *<sup>ϕ</sup>*2*(x*† *<sup>t</sup> )* in the gamma model are compensating for this by taking values bigger than 1. A dispersion parameter bigger than 1 implies a shape parameter in the gamma model of *<sup>α</sup>*† *t* = <sup>1</sup>*/ <sup>ϕ</sup>*2*(x*† *<sup>t</sup> ) <* 1, and the resulting gamma density is strictly decreasing, see Fig. 2.1. If we simulate from this model we receive many observations *X*sim *<sup>t</sup>* close to zero (from the strictly decreasing density). This can be seen from the lower-left part of the graph in Fig. 11.7 (lhs), suggesting that we have many observations with *X*† *<sup>t</sup>* ∈ *(*0*,* 1*)*, or on the log-scale log*(X*† *<sup>t</sup> ) <* 0. However, the graph shows that this is not the case in the real data. Figure 11.7 (middle) shows the boxplot of the estimated shape parameters *α*† *<sup>t</sup>* on the test data, 1 ≤ *t* ≤ *T* , verifying that most insurance policies of the test data *<sup>T</sup>* receive a shape parameter *<sup>α</sup>*† *<sup>t</sup>* less than 1.

We conclude that the inverse Gaussian double FN network model seems to work well for this data, and we give preference to this model. -

## *11.1.4 Pseudo Maximum Likelihood Estimator*

This short section gives a mathematical foundation to parameter estimation under model uncertainty and model misspecification. We summarize the results of Gourieroux et al. [168], and we refrain from giving any proofs in this section. Assume that the real-valued observations *Yi*, 1 ≤ *i* ≤ *n*, have been generated by the model

$$Y\_l = \mu\_{\zeta 0}(\mathbf{x}\_l) + \varepsilon\_l,\tag{11.14}$$

with (true) parameter *<sup>ζ</sup>*<sup>0</sup> <sup>∈</sup><sup>⊂</sup> <sup>R</sup>*r*, feature *<sup>x</sup><sup>i</sup>* <sup>∈</sup> *<sup>X</sup>* ⊆ {1} × <sup>R</sup>*<sup>q</sup>* , and where the conditional distribution of the noise random variables *(εi)*<sup>1</sup>≤*i*≤*<sup>n</sup>* satisfies the conditional independence property *pε(ε*1*,...,εn*|*x*1*,..., <sup>x</sup>n)* <sup>=</sup> <sup>D</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *pε(εi*|*xi)*. Denote by *px(x)* the portfolio distribution of the features *x*. Thus, under (11.14), the claim *Y* of a randomly selected policy is generated by the joint probability measure *p,x(ε, x)* = *pε(ε*|*x)px(x)*. The technical assumptions under which the following statements hold are given in Assumption 11.9 at the end of this section.

Let *F*0*(*·|*xi)* denote the true conditional distribution of *Yi*, given *xi*. Typically, this (true) conditional distribution is unknown. It is assumed to provide the first two conditional moments

$$\mathbb{E}\_{\zeta 0} \left[ \left. Y\_{l} \right| \left. \mathbf{x}\_{l} \right] = \mu\_{\zeta 0} \left( \mathbf{x}\_{l} \right) \qquad \text{and} \qquad \text{Var}\_{\zeta 0} \left( \left. Y\_{l} \right| \left. \mathbf{x}\_{l} \right) = \sigma\_{0}^{2} (\mathbf{x}\_{l}) . \right.$$

Thus, *εi*|*x<sup>i</sup>* is assumed to be centered with conditional variance *<sup>σ</sup>*<sup>2</sup> <sup>0</sup> *(xi)*, see (11.14). Our goal is to estimate the (true) parameter *ζ*<sup>0</sup> ∈, based on the fact that the conditional distribution *F*0*(*·|*x)* of the observations is unknown. Throughout we assume parameter identifiability, i.e., if *μζ*<sup>1</sup> *(x)* = *μζ*<sup>2</sup> *(x)*, *px*-a.s., then *ζ*<sup>1</sup> = *ζ*2. The following estimator is called *pseudo maximum likelihood estimator* (PMLE)

$$\widehat{\xi\_n^{\mathrm{PML.E}}} = \underset{\zeta \in \Lambda}{\mathrm{arg\,min}} \frac{1}{n} \sum\_{i=1}^n \mathfrak{d}(Y\_i, \mu\_{\zeta}(\mathbf{x}\_i)),\tag{11.15}$$

where d*(y, μ)* is the unit deviance of a (pre-chosen) single-parameter linear EDF being parametrized by the same parameter space<sup>⊂</sup> <sup>R</sup>*<sup>r</sup>* as the original random variables (11.14); note thatis not the effective domain of the chosen EDF. *<sup>ζ</sup>* PMLE *<sup>n</sup>* is called PMLE because it is a MLE for *ζ*<sup>0</sup> ∈, but not in the right model, because the pre-chosen EDF in (11.15) typically differs from the (unknown) true conditional distribution *F*0*(*·|*x)*. Nevertheless, we may hope to find the true parameter *ζ*0, but possibly at a slower asymptotic rate. This is exactly what is going to be stated in the next theorems.

**Theorem 11.5 (Theorem 1 of Gourieroux et al. [168])** *Denote by M* = *κ (*˚*) the dual mean parameter space of the pre-chosen EDF (having cumulant function κ), and assume that μζ (x)* ∈ *M for all x* ∈ *X and ζ* ∈ *. Let Assumption 11.9, below, hold. The PMLE <sup>ζ</sup>* PMLE *<sup>n</sup> is strongly consistent for ζ*0*, i.e., it converges a.s. as n* → ∞*.*

This theorem tells us that we can perform MLE in a pre-chosen EDF (which may differ from the true data model), and asymptotically we find the true parameter *ζ*<sup>0</sup> of the data model *<sup>F</sup>*0*(*·|*x)*. Of course, this uses the fact that any unit deviance <sup>d</sup> is a strictly consistent loss function for mean estimation, see Theorem 4.19. We do not only receive consistency, but the following theorem also gives us the rate of convergence.

**Theorem 11.6 (Theorem 3 of Gourieroux et al. [168])** *Set the same assumptions as in Theorem 11.5. The PMLE <sup>ζ</sup>* PMLE *<sup>n</sup> has the following asymptotic behavior*

$$
\sqrt{n}\left(\widehat{\zeta}\_n^{\mathrm{PMLE}} - \zeta\_0\right) \implies \mathcal{N}\left(0, \mathcal{Z}^\*(\xi\_0)^{-1}\Sigma(\xi\_0)\mathcal{Z}^\*(\xi\_0)^{-1}\right) \qquad for \, n \to \infty,
$$

*with the following matrices evaluated in ζ* = *ζ*<sup>0</sup>

$$\begin{aligned} \mathcal{T}^\*(\boldsymbol{\xi}) &= \mathbb{E}\_{\mathbf{x}} \left[ \mathcal{T}^\*(\boldsymbol{\xi}; \mathbf{x}) \right] \ &= \mathbb{E}\_{\mathbf{x}} \left[ J(\boldsymbol{\xi}; \mathbf{x})^\top \boldsymbol{\kappa}^\prime (h(\boldsymbol{\mu}\_{\boldsymbol{\xi}}(\mathbf{x}))) J(\boldsymbol{\xi}; \mathbf{x}) \right] \ &\in \mathbb{R}^{r \times r}, \\ \Sigma(\boldsymbol{\xi}) &= \mathbb{E}\_{\mathbf{x}} \left[ J(\boldsymbol{\xi}; \mathbf{x})^\top \sigma\_0^2(\mathbf{x}) J(\boldsymbol{\xi}; \mathbf{x}) \right] \ &\in \mathbb{R}^{r \times r}, \end{aligned}$$

*where h* = *(κ )* <sup>−</sup><sup>1</sup> *is the canonical link of the pre-chosen EDF, and with the change of variable ζ* → *θ* = *θ (ζ )* = *h(μζ (x))* ∈ *, for given feature x, having Jacobian*

$$J(\boldsymbol{\xi}; \mathbf{x}) = \left(\frac{\partial}{\partial \boldsymbol{\xi}\_k} h(\mu\_{\boldsymbol{\xi}}(\mathbf{x}))\right)\_{1 \le k \le r} = \frac{1}{\kappa''(h(\mu\_{\boldsymbol{\xi}}(\mathbf{x})) } \left(\nabla\_{\boldsymbol{\xi}} \mu\_{\boldsymbol{\xi}}(\mathbf{x})\right)^{\top} \in \mathbb{R}^{1 \times r}.$$

Remark that *I*∗*(ζ )* averages Fisher's information *I*∗*(ζ* ; *x)* (of the chosen EDF) over the feature distribution *px*. This theorem can be seen as a modification of (3.36) to the regression case. Theorem 11.6 gives us the asymptotic normality of the PMLE, and the resulting asymptotic variance depends on how well the pre-chosen EDF matches the true data distribution *F*0*(*·|*x)*. The following lemma corresponds to Property 5 in Gourieroux et al. [168].

**Lemma 11.7** *The asymptotic variance in Theorem 11.6 has the lower bound, set <sup>ζ</sup>* <sup>=</sup> *<sup>ζ</sup>*<sup>0</sup> *and <sup>σ</sup>*2*(x)* <sup>=</sup> *<sup>σ</sup>*<sup>2</sup> <sup>0</sup> *(x),*

$$\mathcal{X}^\*(\boldsymbol{\xi})^{-1}\Sigma(\boldsymbol{\xi})\mathcal{X}^\*(\boldsymbol{\xi})^{-1} \ge \mathcal{H}(\boldsymbol{\xi}) = \mathbb{E}\_{\mathbf{x}}\left[\nabla\_{\boldsymbol{\xi}}\mu\_{\boldsymbol{\xi}}(\mathbf{x})\sigma^{-2}(\mathbf{x})\left(\nabla\_{\boldsymbol{\xi}}\mu\_{\boldsymbol{\xi}}(\mathbf{x})\right)^{\top}\right]^{-1} \in \mathbb{R}^{r \times r}.$$

*Proof* We set *<sup>τ</sup>* <sup>2</sup>*(x)* <sup>=</sup> *<sup>κ</sup>(h(μζ (x)))*. We have *J (ζ* ; *<sup>x</sup>)*- = ∇*ζμζ (x)τ* <sup>−</sup>2*(x)*. The following matrix is positive semi-definite and it satisfies

$$\begin{split} \mathbb{E}\_{\mathbf{z}} \Big[ \Big[ \mathcal{T}^{\star}(\xi)^{-1} J(\xi; \mathbf{x})^{\top} - \mathcal{H}(\xi) J(\xi; \mathbf{x})^{\top} \tau^{2}(\mathbf{x}) \sigma^{-2}(\mathbf{x}) \Big] \sigma^{2}(\mathbf{x}) \\ & \times \Big[ \mathcal{T}^{\star}(\xi)^{-1} J(\xi; \mathbf{x})^{\top} - \mathcal{H}(\xi) J(\xi; \mathbf{x})^{\top} \tau^{2}(\mathbf{x}) \sigma^{-2}(\mathbf{x}) \Big]^{\top} \Big] \\ = \mathcal{T}^{\star}(\xi)^{-1} \Sigma(\xi) \mathcal{T}^{\star}(\xi)^{-1} - \mathcal{H}(\xi) \mathcal{T}^{\star}(\xi) \mathcal{T}^{\star}(\xi)^{-1} - \mathcal{T}^{\star}(\xi)^{-1} \mathcal{T}^{\star}(\xi) \mathcal{H}(\xi) + \mathcal{H}(\xi) \mathcal{H}(\xi)^{-1} \mathcal{H}(\xi) \\ &= \mathcal{T}^{\star}(\xi)^{-1} \Sigma(\xi) \mathcal{T}^{\star}(\xi)^{-1} - \mathcal{H}(\xi). \end{split}$$

This proves the claim.

Theorem 11.6 and Lemma 11.7 tell us that if we estimate the parameter *ζ*<sup>0</sup> of the unknown model *F*0*(*·|*x)* with PMLE based on a single-parameter linear EDF, we receive minimal asymptotic variance if we can match the variance *V (μζ*<sup>0</sup> *(x))* = *κ(h(μζ*<sup>0</sup> *(x)))* of the chosen EDF with the variance *σ*<sup>2</sup> <sup>0</sup> *(x)* of the true data model. E.g., if we know that the variance in the true model behaves as *σ*<sup>2</sup> <sup>0</sup> *(x)* <sup>=</sup> *<sup>μ</sup>*<sup>3</sup> *ζ*0 *(x)* we should select the inverse Gaussian model with variance function *V (μ)* <sup>=</sup> *<sup>μ</sup>*<sup>3</sup> for PMLE.

If the members of the single-parameter linear EDF do not fully match the variance structure of the true data, we can turn our attention to a dispersion submodel as in Sect. 5.5.1. Assume for the variance structure of the true data

$$\text{Var}\_{\xi\_0}(Y\_l|\mathbf{x}\_l) = \sigma\_0^2(\mathbf{x}\_l) = \frac{1}{\upsilon\_l} \mathbf{s}\_{\alpha\_0}^2(\mathbf{x}\_l),$$

$$\square$$

for a regression function *<sup>x</sup>* <sup>→</sup> *<sup>s</sup>*<sup>2</sup> *<sup>α</sup>*<sup>0</sup> *(x)* involving the (true) regression parameter *α*<sup>0</sup> and exposures *vi >* 0. If we choose a fixed EDF, we have the log-likelihood function

$$\ell(\mu,\varphi) \mapsto \ell\_Y(\mu,\varphi;v) = \frac{v}{\varphi} \left[ Yh(\mu) - \kappa(h(\mu)) \right] + a(\mathbf{y};v/\varphi).$$

Equating the variance structure of the true data model with the variance in this prespecified EDF, we obtain feature-dependent dispersion parameter

$$\varphi(\mathbf{x}\_{l}) = \frac{s\_{\alpha\_{0}}^{2}(\mathbf{x}\_{l})}{V(\mu\_{\xi\_{0}}(\mathbf{x}\_{l}))},\tag{11.16}$$

with variance function *V (μ)* = *(κ* ◦ *h)(μ)*. The following theorem proposes a two-step procedure for this estimation problem.

**Theorem 11.8 (Theorem 4 of Gourieroux et al. [168])** *Assume* \**ζn and* \**αn are strongly consistent estimators for <sup>ζ</sup>*<sup>0</sup> *and <sup>α</sup>*0*, as <sup>n</sup>* → ∞*, such that* <sup>√</sup>*n(*\**ζn* <sup>−</sup> *<sup>ζ</sup>*0*) and* <sup>√</sup>*n(*\**αn* <sup>−</sup> *<sup>α</sup>*0*) are bounded in probability. The quasi-generalized pseudo maximum likelihood estimator (QPMLE) of ζ*<sup>0</sup> *is obtained by*

$$\widehat{\zeta}\_n^{\text{QPMLE}} = \underset{\zeta \in \Lambda}{\text{arg}\max} \sum\_{i=1}^n \ell\_{Y\_i} \left( \mu\_{\zeta}(\mathbf{x}\_i), \frac{s\_{\widetilde{\alpha}\_n}^2(\mathbf{x}\_i)}{V(\mu\_{\widetilde{\zeta}\_n}(\mathbf{x}\_i))}; v\_i \right).$$

*Under Assumption 11.9, below, <sup>ζ</sup>* QPMLE *<sup>n</sup> is strongly consistent and best asymptotically normal, i.e.,*

$$\sqrt{n}\left(\widehat{\xi}\_{n}^{\mathrm{QPMLE}} - \xi\_{0}\right) \implies \mathcal{N}\left(0, \mathcal{H}(\xi\_{0})\right) \qquad for \, n \to \infty.$$

This justifies the approach(es) in the previous chapters and sections, though, not fully, because we neither work with the MLE in FN networks nor do we care about identifiability in parameters. Nevertheless, this short section suggests to find strongly consistent estimators \**ζn* and \**αn* for *<sup>ζ</sup>*<sup>0</sup> and *<sup>α</sup>*0. This gives us a first model calibration step that allows us to specify the dispersion structure *x* → *ϕ(x)* via (11.16). Using this dispersion structure and the deviance loss function (4.9) for a variable dispersion parameter *ϕ(x)*, the QPMLE is obtained in the second step by, we replace the likelihood maximization by the deviance loss minimization,

$$\widehat{\xi}\_n^{\text{QPML.E}} = \underset{\zeta \in \Lambda}{\text{arg}\min} \frac{1}{n} \sum\_{i=1}^n \frac{\upsilon\_i}{s\_{\widetilde{\alpha}\_n}^2(\mathbf{x}\_i) / V(\mu\_{\widetilde{\zeta}\_n}(\mathbf{x}\_i))} \,\mathfrak{d}(Y\_i, \mu\_{\zeta}(\mathbf{x}\_i)) .$$

This QPMLE is best asymptotically normal, thus, asymptotically optimal within the EDF. There might still be better estimators for *ζ*0, but these are outside the EDF.

If we turn M-estimation into Z-estimation we have the requirement for *ζ* , see also (11.5),

$$\frac{1}{n}\sum\_{l=1}^{n}v\_{l}\frac{V(\mu\_{\widetilde{\zeta}\_{n}}(\mathbf{x}\_{l}))}{s\_{\widetilde{\alpha}\_{u}}^{2}(\mathbf{x}\_{l})}\frac{Y\_{l}-\mu\_{\zeta}(\mathbf{x}\_{l})}{V(\mu\_{\zeta}(\mathbf{x}\_{l}))}\,\nabla\_{\zeta}\mu\_{\zeta}(\mathbf{x}\_{l})\stackrel{!}{=}\,{0.1}$$

Thus, it all boils down to find the right variance structure to receive the optimal asymptotic behavior.

The previous statements hold true under the following technical assumptions. These are taken from Appendix 1 of Gourieroux et al. [167], and they are an adapted version of the ones in Burguete et al. [61].

#### **Assumption 11.9**


$$\int\_{\mathbb{R}} \sup\_{\mathbf{x}' \in N\_{\mathbf{x}}} b(\varepsilon, \mathbf{x}') \, dp\_{\varepsilon}(\varepsilon | \mathbf{x}) < \infty;$$

*(vi) the functions* d*(Y, μζ (x)), ∂*d*(Y, μζ (x))/∂ζk, ∂*2d*(Y, μζ (x))/∂ζk∂ζl are dominated by b(ε, x).*

## **11.2 Deep Quantile Regression**

So far, in network regression modeling, we have not addressed the question of prediction uncertainty. As mentioned in Remarks 4.2 on forecast evaluation, there are different sources that contribute to prediction uncertainty. There is the model and parameter estimation uncertainty, which may result in an inappropriate model choice, and there is the irreducible risk which comes from the fact that we forecast random variables which inherit a natural randomness that cannot be controlled.

We have discussed methods of evaluating model and parameter estimation error, such as the asymptotic normality of MLEs within GLMs, and we have discussed forecast dominance, the bootstrap method or the nagging predictor that allow one to assess the different sources of prediction uncertainty. However, we have not explicitly quantified these sources of uncertainty within the class of network regression models. We do an attempt in Sect. 11.4, below, by considering the fluctuations generated by bootstrap simulations. The irreducible risk can be assessed once we have a suitable statistical model; in Example 11.4 we have studied a gamma and an inverse Gaussian model on an explicit data set, and these models can be used, e.g., to calculate quantiles. In this section we consider a distributionfree approach that directly estimates these quantiles. Recall from Section 5.8.3 that quantiles are elicitable with the pinball loss as a strictly consistent loss function, see Theorem 5.33. This allows us to directly estimate the quantiles from the data.

## *11.2.1 Deep Quantile Regression: Single Quantile*

In this section we present a way of assessing the irreducible risk which does not require a sophisticated model evaluation of distributional assumptions. Quantile regression is increasingly used in the machine learning community because it is a robust way of quantifying the irreducible risk, we refer to Meinshausen [270], Takeuchi et al. [350] and Richman [314]. We recall that quantiles are elicitable having the pinball loss as a strictly consistent loss function, see Theorem 5.33. We define a FN network regression model that allows us to directly estimate the quantiles based on the pinball loss. We therefore use an adapted version of the R code of Listing 9 in Richman [314], this adapted version has been proposed in Fissler et al. [130] to ensure that different quantiles respect monotonicity. For any two quantile levels 0 *< τ*<sup>1</sup> *< τ*<sup>2</sup> *<* 1 we have

$$F^{-1}(\mathfrak{r}\_{\mathbb{I}}) \le F^{-1}(\mathfrak{r}\_{\mathbb{Z}}),\tag{11.17}$$

where *F* <sup>−</sup><sup>1</sup> denotes the generalized inverse of distribution function *F*, see (5.80). If we simultaneously learn these quantiles for different quantile levels *τ*<sup>1</sup> *< τ*2, we need to enforce the network to respect this monotonicity (11.17). This can be achieved by exploring a special network architecture in the output layer, and this is going to be presented in the next section.

We start by considering a single deep *τ* -quantile regression for a quantile level *τ* ∈ *(*0*,* 1*)*. For datum *(Y, x)* we consider the regression function

$$\mathbf{x} \mapsto F\_{Y|\mathbf{x}}^{-1}(\tau) = \mathbf{g}^{-1} \langle \mathfrak{P}\_{\tau}, \mathbf{z}^{(d:1)}(\mathbf{x}) \rangle,\tag{11.18}$$

for a strictly monotone and smooth link function *<sup>g</sup>*, output parameter *<sup>β</sup><sup>τ</sup>* <sup>∈</sup> <sup>R</sup>*qd*+1, and where *<sup>x</sup>* <sup>→</sup> *<sup>z</sup>(d*:1*) (x)* is a deep network. We add a lower index *Y* |*x* to the generalized inverse *F* <sup>−</sup><sup>1</sup> *<sup>Y</sup>* <sup>|</sup>*<sup>x</sup>* to highlight that we consider the conditional distribution of *Y* , given feature *x* ∈ *X*. In the case of a deep FN network, (11.18) involves a network parameter *<sup>ϑ</sup>* <sup>=</sup> *(w(*1*)* <sup>1</sup> *,..., <sup>w</sup>(d) qd , β<sup>τ</sup> )* that needs to be estimated. Of course, the deep network architecture *<sup>x</sup>* <sup>→</sup> *<sup>z</sup>(d*:1*) (x)* could also involve any other feature, such as CN or LSTM layers, embedding layers or a NLP text recognition feature. This would change the network architecture, but it would not change anything from a methodological viewpoint.

To estimate this regression parameter *ϑ* from independent data *(Yi, xi)*, 1 ≤ *i* ≤ *n*, we consider the objective function

$$\mathfrak{g} \mapsto \sum\_{i=1}^n L\_{\mathfrak{r}} \left( Y\_{\mathfrak{l}}, \operatorname{g}^{-1} \langle \mathfrak{g}\_{\mathfrak{r}}, \operatorname{z}^{(d:\operatorname{l})} (\mathfrak{x}\_{\mathfrak{l}}) \rangle \right),$$

with the strictly consistent pinball loss function *Lτ* for the *τ* -quantile. Alternatively, we could choose any other loss function satisfying Theorem 5.33, and we may try to find the asymptotically optimal one (similarly to Theorem 11.8). We refrain from doing so, but we mention Komunjer–Vuong [222]. Fitting the network parameter *ϑ* is then done in complete analogy to finding an optimal network parameter for network mean modeling. The only change is that we replace the deviance loss function by the pinball loss, e.g., in Listing 7.3 we have to exchange the loss function on line 5 correspondingly.

## *11.2.2 Deep Quantile Regression: Multiple Quantiles*

We now turn our attention to the multiple quantile case that should satisfy the monotonicity requirement (11.17) for any quantile levels 0 *< τ*<sup>1</sup> *< τ*<sup>2</sup> *<* 1. A separate deep quantile estimation for both quantile levels, as described in the previous section, may violate the monotonicity property, at least, in some part of the feature space *X*, especially if the two quantile levels are close. Therefore, we enforce the monotonicity by a special choice of the network architecture.

For simplicity, in the remainder of this section, we assume that the response *Y* is positive, a.s. This implies for the quantiles *<sup>τ</sup>* <sup>→</sup> *<sup>F</sup>* <sup>−</sup><sup>1</sup> *<sup>Y</sup>* <sup>|</sup>*x(τ )* <sup>≥</sup> 0, and we should choose a link function with *<sup>g</sup>*−<sup>1</sup> <sup>≥</sup> 0 in (11.18). To ensure the monotonicity (11.17) for the quantile levels 0 *< τ*<sup>1</sup> *< τ*<sup>2</sup> *<* 1, we choose a second positive link function with *g*−<sup>1</sup> <sup>+</sup> ≥ 0, and we set for multi-task forecasting

$$\mathbf{x} \mapsto \left( F\_{Y|\mathbf{x}}^{-1}(\mathbf{r}\_{\mathrm{l}}), \ F\_{Y|\mathbf{x}}^{-1}(\mathbf{r}\_{\mathrm{2}}) \right)^{\top} \tag{11.19}$$
 
$$= \left( \mathbf{g}^{-1} \langle \boldsymbol{\mathcal{B}}\_{\mathbf{r}\_{\mathrm{l}}}, \mathbf{z}^{(d:\mathrm{l})}(\mathbf{x}) \rangle, \ g^{-1} \langle \boldsymbol{\mathcal{B}}\_{\mathbf{r}\_{\mathrm{l}}}, \mathbf{z}^{(d:\mathrm{l})}(\mathbf{x}) \rangle + \mathbf{g}\_{+}^{-1} \langle \boldsymbol{\mathcal{B}}\_{\mathbf{r}\_{\mathrm{2}}}, \mathbf{z}^{(d:\mathrm{l})}(\mathbf{x}) \rangle \right)^{\top} \in \mathbb{R}\_{+}^{2},$$

for a regression parameter *<sup>ϑ</sup>* <sup>=</sup> *(w(*1*)* <sup>1</sup> *,..., <sup>w</sup>(d) qd , βτ*<sup>1</sup> *, βτ*<sup>2</sup> *)*-. The positivity *g*−<sup>1</sup> <sup>+</sup> ≥ 0 enforces the monotonicity in the two quantiles. We call (11.19) an *additive approach* as we start from a base level characterized by the smaller quantile *F* <sup>−</sup><sup>1</sup> *<sup>Y</sup>* <sup>|</sup>*x(τ*1*)*, and any bigger quantile is modeled by an additive increment. To ensure monotonicity for multiple quantiles we proceed recursively by choosing the lowest quantile as the initial base level.

We can also consider the upper quantile as the base level by multiplicatively lowering this upper quantile. Choose the (sigmoid) function *g*−<sup>1</sup> *<sup>σ</sup>* ∈ *(*0*,* 1*)* and set for the *multiplicative approach*

$$\mathbf{x} \mapsto \left( F\_{Y|\mathbf{x}}^{-1}(\mathbf{r}), \ F\_{Y|\mathbf{x}}^{-1}(\mathbf{r}2) \right)^{\top} \tag{11.20}$$
 
$$= \left( \mathbf{g}\_{\sigma}^{-1} \langle \boldsymbol{\mathcal{B}}\_{\sigma\_{1}}, \mathbf{z}^{(d:1)}(\mathbf{x}) \rangle \, \mathbf{g}^{-1} \langle \boldsymbol{\mathcal{B}}\_{\sigma\_{2}}, \mathbf{z}^{(d:1)}(\mathbf{x}) \rangle, \ \mathbf{g}^{-1} \langle \boldsymbol{\mathcal{B}}\_{\sigma\_{2}}, \mathbf{z}^{(d:1)}(\mathbf{x}) \rangle \right)^{\top} \in \mathbb{R}\_{+}^{2} \,.$$

*Remark 11.10* In (11.19) and (11.20) we directly enforce the monotonicty by a corresponding regression function choice. Alternatively, we can also design a (plainvanilla) multi-output network

$$\mathbf{x} \mapsto \begin{pmatrix} F\_{Y|\mathbf{x}}^{-1}(\mathbf{r}\_1), \ F\_{Y|\mathbf{x}}^{-1}(\mathbf{r}\_2) \end{pmatrix}^{\top} \tag{11.21}$$
 
$$= \left( \mathbf{g}^{-1} \langle \boldsymbol{\mathcal{B}}\_{\mathbf{r}\_1}, \mathbf{z}^{(d:\mathbf{l})}(\mathbf{x}) \rangle, \ \mathbf{g}^{-1} \langle \boldsymbol{\mathcal{B}}\_{\mathbf{r}\_2}, \mathbf{z}^{(d:\mathbf{l})}(\mathbf{x}) \rangle \right)^{\top} \in \mathbb{R}\_+^2.$$

If we just use a classical SGD fitting algorithm, we will likely result in a situation where the monotonicity will be violated in some part of the feature space. Kellner et al. [211] consider this problem. They add a penalization (regularization term) that punishes during SGD training network parameters that violate the monotonicity. Such a penalization can be constructed, e.g., with the ReLU function.

## *11.2.3 Lab: Deep Quantile Regression*

We revisit the Swiss accident insurance data of Sect. 11.1.2, and we provide an example of a deep quantile regression using both the additive approach (11.19) and the multiplicative approach (11.20).

We select 5 different quantile levels *Q* = *(τ*1*, τ*2*, τ*3*, τ*4*, τ*5*)* = *(*10%*,* 25%*,* 50%*,* 75%*,* 90%*)*. We start with the additive approach (11.19). It requires to set *τ*<sup>1</sup> = 10% as the base level, and the remaining quantile levels are modeled additively in a recursive way for *τj < τj*+1, 1 ≤ *j* ≤ 4. The corresponding R code is given on lines 8–20 of Listing 11.3, and this compiles to the 5-dimensional output on line 22. For the multiplicative approach (11.20) we set *τ*<sup>5</sup> = 90% as the base level, and the remaining quantile levels are received multiplicatively in a recursive way for *τj*+<sup>1</sup> *> τj* , 4 ≥ *j* ≥ 1, see Listing 11.4. The additive and the multiplicative approaches take the extreme quantiles as initialization. One may also be interested in initializing the model in the median *τ*<sup>3</sup> = 50%, the smaller quantiles can then be received by the multiplicative approach and the bigger quantiles by the additive approach. We also explore this case and we call it the *mixed approach*.

**Listing 11.3** Multiple FN quantile regression: additive approach

```
1 Design = layer_input(shape = c(q0), dtype = 'float32', name = 'Design')
2 #
3 Network = Design %>%
4 layer_dense(units=20, activation='tanh', name='FNLayer1') %>%
5 layer_dense(units=15, activation='tanh', name='FNLayer2') %>%
6 layer_dense(units=10, activation='tanh', name='FNLayer3')
7 #
8 q1 = Network %>% layer_dense(units=1, activation='exponential')
9 #
10 q20 = Network %>% layer_dense(units=1, activation='exponential')
11 q2 = list(q1,q20) %>% layer_add()
12 #
13 q30 = Network %>% layer_dense(units=1, activation='exponential')
14 q3 = list(q2,q30) %>% layer_add()
15 #
16 q40 = Network %>% layer_dense(units=1, activation='exponential')
17 q4 = list(q3,q40) %>% layer_add()
18 #
19 q50 = Network %>% layer_dense(units=1, activation='exponential')
20 q5 = list(q4,q50) %>% layer_add()
21 #
22 model = keras_model(inputs = list(Design), outputs = c(q1,q2,q3,q4,q5))
```
**Listing 11.4** Multiple FN quantile regression: multiplicative approach

```
1 q5 = Network %>% layer_dense(units=1, activation='exponential')
2 #
3 q40 = Network %>% layer_dense(units=1, activation='sigmoid')
4 q4 = list(q5,q40) %>% layer_multiply()
5 #
6 q30 = Network %>% layer_dense(units=1, activation='sigmoid')
7 q3 = list(q4,q30) %>% layer_multiply()
8 #
9 q20 = Network %>% layer_dense(units=1, activation='sigmoid')
10 q2 = list(q3,q20) %>% layer_multiply()
11 #
12 q10 = Network %>% layer_dense(units=1, activation='sigmoid')
13 q1 = list(q2,q10) %>% layer_multiply()
```
**Listing 11.5** Fitting a multiple FN quantile regression

```
1 Q_loss1 = function(y_true, y_pred){k_mean(k_maximum(y_true - y_pred, 0) * 0.1
2 + k_maximum(y_pred - y_true, 0) * (1 - 0.1))}
3 Q_loss2 = function(y_true, y_pred){k_mean(k_maximum(y_true - y_pred, 0) * 0.25
4 + k_maximum(y_pred - y_true, 0) * (1 - 0.25))}
5 Q_loss3 = function(y_true, y_pred){k_mean(k_maximum(y_true - y_pred, 0) * 0.5
6 + k_maximum(y_pred - y_true, 0) * (1 - 0.5))}
7 Q_loss4 = function(y_true, y_pred){k_mean(k_maximum(y_true - y_pred, 0) * 0.75
8 + k_maximum(y_pred - y_true, 0) * (1 - 0.75))}
9 Q_loss5 = function(y_true, y_pred){k_mean(k_maximum(y_true - y_pred, 0) * 0.9
10 + k_maximum(y_pred - y_true, 0) * (1 - 0.9))}
11 #
12 model %>% compile(loss = list(Q_loss1,Q_loss2,Q_loss3,Q_loss4,Q_loss5),
13 optimizer = 'nadam')
```
These network architectures are fitted to the data using the pinball loss (5.81) for the quantile levels of *Q*; note that the pinball loss requires the assumption of having a finite first moment. Listing 11.5 shows the choice of the pinball loss functions. We then fit the three architectures (additive, multiplicative and mixed) to our learning data *L*, and we apply early stopping to prevent from over-fitting. Moreover, we consider the nagging predictor over 20 runs with different seeds to reduce the randomness coming from SGD fitting.

In Table 11.6 we give the out-of-sample pinball losses on the test data *T* of the three considered approaches, and illustrating the 5 quantile levels of *Q*. The losses of the three approaches are rather close, giving a slight preference to the mixed approach, but the other two approaches seem to be competitive, too. We further analyze these quantile regression models by considering the empirical coverage ratios defined by

$$\widehat{\pi}\_{\mathcal{I}} = \frac{1}{T} \sum\_{t=1}^{T} \mathbb{1}\_{\left\{ Y\_{t}^{\dagger} \leq \widehat{F}^{-1}\_{Y|\mathbf{x}\_{t}^{\dagger}}(\mathbf{r}\_{\mathcal{I}}) \right\}},\tag{11.22}$$

where *F* −1 *<sup>Y</sup>* <sup>|</sup>*x*† *t (τj )* is the estimated quantile for level *τj* and feature *x*† *<sup>t</sup>* . Remark that the coverage ratios (11.22) correspond to the identification functions that are essentially the derivatives of the pinball losses, we refer to Dimitriadis et al. [106]. Table 11.7 reports these out-of-sample coverage ratios on the test data *T* . From these results we conclude that on the portfolio level the quantiles are matched rather well.

In Fig. 11.8 we illustrate the estimated out-of-sample quantiles *F* −1 *<sup>Y</sup>* <sup>|</sup>*x*† *t (τj )* for individual claims on the quantile levels *τj* ∈ {10%*,* 25%*,* 50%*,* 75%*,* 90%} (cyan, blue, black, blue, cyan colors) using the mixed approach. The *x*-axis considers the logged estimated medians *F* −1 *<sup>Y</sup>* <sup>|</sup>*x*† *t (*50%*)*. We observe heteroskedasticity resulting in quantiles that are not ordered w.r.t. the median (black line). This supports the multiple deep quantile regression model because we cannot (simply) extrapolate the median to receive the other quantiles.

In the final step we compare the estimated quantiles *F* −1 *<sup>Y</sup>* <sup>|</sup>*<sup>x</sup> (τj )* from the mixed deep quantile regression approach to the ones that can be calculated from the fitted inverse Gaussian model using the double FN network approach of Example 11.4. In the latter model we estimate the mean *μ(x)* and the dispersion *ϕ(x)* with two FN networks, which then allow us to calculate the quantiles using the inverse Gaussian distributional assumption. Note that we cannot calculate the quantiles in Tweedie's family with power variance parameter *p* = 2*.*5 because there is no


**Table 11.6** Out-of-sample pinball losses of quantile regressions using the additive, the multiplicative and the mixed approaches; nagging predictors over 20 different seeds


**Table 11.7** Out-of-sample coverage ratios *τj* below the estimated deep FN quantile estimates *F* −1 *<sup>Y</sup>* <sup>|</sup>*x*† *(τj )*

#### **Fig. 11.8** Estimated out-of-sample quantiles *F* −1 *<sup>Y</sup>* <sup>|</sup>*x*† *t (τj )* of 2'000 randomly selected individual claims on the quantile levels *τj* ∈ {10%*,* 25%*,* 50%*,* 75%*,* 90%} (cyan, blue, black, blue, cyan colors) using the mixed approach, the red dots are the out-of-sample observations *Y*† *<sup>t</sup>* ; the *x*-axis gives log*F* −1 *<sup>Y</sup>* <sup>|</sup>*x*† *t (*50%*)* (also corresponding to the black

#### **quantiles on individual claims**

closed form of the distribution function. Figure 11.9 compares the two approaches on the quantile levels of *Q*. Overall we observe a reasonably good match though it is not perfect. The small quantiles for level *τ*<sup>1</sup> = 10% seem slightly under-estimated by the inverse Gaussian approach (see Fig. 11.9 (top-left)), whereas big quantiles *τ*<sup>4</sup> = 75% and *τ*<sup>5</sup> = 90% seem more conservative in the inverse Gaussian approach (see Fig. 11.9 (bottom)). This may indicate that the inverse Gaussian distribution does not fully fit the data, i.e., that one cannot fully recover the true quantiles from the mean *μ(x)*, the dispersion *ϕ(x)* and an inverse Gaussian assumption. There are two ways to further explore these issues. One can either choose other distributional assumptions which may better match the properties of the data, this further explores the distributional approach. Alternatively, Theorem 5.33 allows us to choose loss functions different from the pinball loss, i.e., one could consider different increasing functions *G* in that theorem to further explore the distributionfree approach. In general, any increasing choice of the function *G* leads to a strictly consistent quantile estimation (this is an asymptotic statement), but these choices may have different finite sample properties. Following Komunjer–Vuong [222], we can determine asymptotically efficient choices for *G*. This would require feature dependent choices *Gxi(y)* = *FY* <sup>|</sup>*<sup>x</sup>i(y)*, where *FY* <sup>|</sup>*x<sup>i</sup>* is the (true) distribution of *Yi*, conditionally given *xi*. This requires the knowledge of the true distribution, and Komunjer–Vuong [222] derive asymptotic efficiency when replacing this true

**Fig. 11.9** Inverse Gaussian quantiles vs. deep quantile regression estimates of 2'000 randomly selected claims on the quantile levels of *Q* = *(*10%*,* 25%*,* 50%*,* 75%*,* 90%*)*

distribution by a non-parametric estimator, this is in spirit similar to Theorem 11.8. We refrain from giving more details but refer to the corresponding paper.

## **11.3 Deep Composite Model Regression**

We have established a deep quantile regression in the previous section. Next we jointly estimate quantiles and conditional tail expectations (CTEs), leading to a composite regression model that has a splicing point determined by a quantile level; for composite models we refer to Sect. 6.4.4. This is exactly the proposal of Fissler et al. [130] which we are going to present in this section. Note that having a composite model allows us to have different distributions and regression structures below and above the splicing point, e.g., we can have a more heavy-tailed model in the upper tail using a different feature engineering from the main body of the data.

## *11.3.1 Joint Elicitability of Quantiles and Expected Shortfalls*

In the previous examples we have seen that the distributional models may misestimate the true tail of the data because model fitting often pays more attention to an accurate model fit in the main body of the data. An idea is to directly estimate this tail in a distribution-free way by considering the (upper) CTE

$$\text{CTE}\_{\mathbf{r}}^{+}(Y|\mathbf{x}) = \mathbb{E}\left[Y \, \middle| \, Y > F\_{Y|\mathbf{x}}^{-1}(\mathbf{r}), \, \mathbf{x} \right],\tag{11.23}$$

for a given quantile level *τ* ∈ *(*0*,* 1*)*. The problem with (11.23) is that this is not an elicitable quantity, i.e., there is no loss/scoring function that is strictly consistent for the CTE functional.

If the distribution function *FY* <sup>|</sup>*<sup>x</sup>* is continuous, we can rewrite the upper CTE as follows, see Lemma 2.16 in McNeil et al. [268] and (11.35) below,

$$\text{CTE}\_{\mathbf{r}}^{+}(Y|\mathbf{x}) = \text{ES}\_{\mathbf{r}}^{+}(Y|\mathbf{x}) = \frac{1}{1-\mathsf{r}} \int\_{\mathsf{r}}^{1} F\_{Y|\mathbf{x}}^{-1}(p) \, dp \ge \, F\_{Y|\mathbf{x}}^{-1}(\mathsf{r}).\tag{11.24}$$

This second object ES+ *<sup>τ</sup> (Y* |*x)* is called the upper expected shortfall (ES) of *Y* , given *x*, on the security level *τ* . Fissler–Ziegel [131] and Fissler et al. [132] have proved that ES+ *<sup>τ</sup> (Y* <sup>|</sup>*x)* is *jointly* elicitable with the *<sup>τ</sup>* -quantile *<sup>F</sup>* <sup>−</sup><sup>1</sup> *<sup>Y</sup>* <sup>|</sup>*x(τ )*. That is, there is a strictly consistent bivariate loss function that allows one to jointly estimate the *τ* quantile and the corresponding ES. In fact, Corollary 5.5 of Fissler–Ziegel [131] give the full characterization of the strictly consistent bivariate loss functions for the joint elicitability of the *τ* -quantile and the ES; note that Fissler–Ziegel [131] use a different sign convention. This result is used in Guillén et al. [175] for the joint estimation of the quantile and the ES within a GLM. Guillén et al. [175] use a two-step approach to fit the quantile and the ES.

Fissler et al. [130] extend the results of Fissler–Ziegel [131], allowing for the joint estimation of the *composite triplet* consisting of the lower ES, the *τ* -quantile and the upper ES. This gives us a composite model that has the *τ* -quantile as splicing point. The beauty of this approach is that we can fit (in one step) a deep learning model to the upper and the lower ES, and perform a (potentially different) regression in both parts of the distribution. The lower CTE and the lower ES are defined by, respectively,

$$\text{CTE}^{-}\_{\mathfrak{r}}(Y|\mathfrak{x}) = \mathbb{E}\left[Y \, \middle| \, Y \le F^{-1}\_{Y|\mathfrak{x}}(\mathfrak{r}), \, \mathfrak{x} \right],$$

and

$$\mathrm{ES}\_{\mathfrak{r}}^{-}(Y|\mathfrak{x}) = \frac{1}{\mathfrak{r}} \int\_{0}^{\mathfrak{r}} F\_{Y|\mathfrak{x}}^{-1}(p) \, dp \; \le \,\, F\_{Y|\mathfrak{x}}^{-1}(\mathfrak{r}).$$

Again, in case of a continuous distribution function *FY* <sup>|</sup>*<sup>x</sup>* we have the following identity CTE− *<sup>τ</sup> (Y* |*x)* = ES<sup>−</sup> *<sup>τ</sup> (Y* |*x)*. From the lower and upper CTEs we receive the mean of *Y* , given *x*, by

$$\mu(\mathbf{x}) = \mathbb{E}[Y|\mathbf{x}] = \tau \,\mathrm{CTE}\_{\mathbf{r}}^{-}(Y|\mathbf{x}) + (1-\tau)\,\mathrm{CTE}\_{\mathbf{r}}^{+}(Y|\mathbf{x}).\tag{11.25}$$

We introduce the auxiliary scoring functions

$$\begin{aligned} S\_{\mathfrak{r}}^{-}(\mathsf{y},a) &= \left(\mathbb{1}\_{\{\mathsf{y}\leq a\}}-\mathsf{r}\right)a - \mathbb{1}\_{\{\mathsf{y}\leq a\}\mathsf{Y}}, \\ S\_{\mathfrak{r}}^{+}(\mathsf{y},a) &= \left(\mathbb{1}-\mathsf{r}-\mathbb{1}\_{\{\mathsf{y}>a\}}\right)a + \mathbb{1}\_{\{\mathsf{y}>a\}\mathsf{Y}} = S\_{\mathfrak{r}}^{-}(\mathsf{y},a) + \mathsf{y}, \end{aligned}$$

for *y,a* <sup>∈</sup> <sup>R</sup> and for *<sup>τ</sup>* <sup>∈</sup> *(*0*,* <sup>1</sup>*)*. These auxiliary functions consider only the part of the pinball loss (5.81) that depends on action *a*, and we get the pinball loss as follows

$$L\_{\mathfrak{r}}(\mathbf{y}, a) = S\_{\mathfrak{r}}^{-}(\mathbf{y}, a) + \mathfrak{r}\mathbf{y} = S\_{\mathfrak{r}}^{+}(\mathbf{y}, a) - (1 - \mathfrak{r})\mathbf{y}.$$

Therefore, all three functions provide strictly consistent scoring functions for the *τ* -quantile, but only the pinball loss satisfies the calibration property (L0) on page 92.

For the following theorem we recall the general definition of the *τ* -quantile *Qτ (FY* <sup>|</sup>*<sup>x</sup> )* of a distribution function *FY* <sup>|</sup>*x*, see (5.82).

**Theorem 11.11 (Theorem 2.8 of Fissler et al. [130], Without Proof)** *Choose τ* ∈ *(*0*,* 1*) and let F contain only distributions with a finite first moment, and being supported in the interval* <sup>C</sup> <sup>⊆</sup> <sup>R</sup>*. The loss function <sup>L</sup>* : <sup>C</sup> <sup>×</sup> <sup>C</sup><sup>3</sup> <sup>→</sup> <sup>R</sup><sup>+</sup> *of the form*

$$L(\mathbf{y}; e^-, q, e^+) = (G(\mathbf{y}) - G(q)) \left( \mathbf{r} - \mathbb{1}\_{\{\mathbf{y} \le q\}} \right) \tag{11.26}$$

$$+ \left\langle \nabla \Psi(e^-, e^+), \begin{pmatrix} e^- + \frac{1}{\tau} S^-\_{\tau}(\mathbf{y}, q) \\ e^+ - \frac{1}{\Gamma - \tau} S^+\_{\tau}(\mathbf{y}, q) \end{pmatrix} \right\rangle - \Psi(e^-, e^+) + \Psi(\mathbf{y}, \mathbf{y}),$$

*is strictly consistent for the composite triplet (*ES− *<sup>τ</sup> , Qτ ,* ES<sup>+</sup> *<sup>τ</sup> ) relative to the class <sup>F</sup>, if is strictly convex with (sub-)gradient* <sup>∇</sup> *such that for all (e*−*, e*+*)* <sup>∈</sup> <sup>C</sup><sup>2</sup> *the function*

$$q \mapsto G\_{e^-,e^+}(q) = G(q) + \frac{1}{\tau} \frac{\partial}{\partial e^-} \Psi(e^-, e^+) q - \frac{1}{1 - \tau} \frac{\partial}{\partial e^+} \Psi(e^-, e^+) q,\tag{11.27}$$

*is strictly increasing, and if* <sup>E</sup>*<sup>F</sup>* [|*G(Y )*|] *<sup>&</sup>lt;* <sup>∞</sup>*,* <sup>E</sup>*<sup>F</sup>* [|*(Y, Y )*|] *<sup>&</sup>lt;* <sup>∞</sup> *for all <sup>Y</sup>* <sup>∼</sup> *F* ∈ *F.*

This opens the door for regression modeling of CTEs for continuous distribution functions *FY* <sup>|</sup>*x*, *x* ∈ *X*. Namely, we can choose a regression function *ξ<sup>ϑ</sup>* with a three-dimensional output

$$
\mathfrak{x} \in \mathcal{X} \mapsto \xi\_{\mathfrak{G}}(\mathfrak{x}) \in \mathfrak{C}^3,
$$

depending on a regression parameter *ϑ*. This regression function is now used to describe the composite triplet *(*ES− *<sup>τ</sup> (Y* <sup>|</sup>*x), F* <sup>−</sup><sup>1</sup> *<sup>Y</sup>* <sup>|</sup>*<sup>x</sup> (τ ),* ES<sup>+</sup> *<sup>τ</sup> (Y* |*x))*. Having i.i.d. data *(Yi, xi)*, 1 ≤ *i* ≤ *n*, it can be fitted by solving

$$\widehat{\mathfrak{Y}} \;= \operatorname\*{arg\,min}\_{\mathfrak{Y}} \frac{1}{n} \sum\_{l=1}^{n} L\left(Y\_l; \xi\_{\mathfrak{Y}}(\mathbf{x}\_l)\right), \tag{11.28}$$

with loss function *L* given by (11.26). This then provides us with the estimates for the composite triplet

$$\mathbf{x} \mapsto \xi\_{\widehat{\mathfrak{Y}}}(\mathbf{x}) = \left( \widehat{\mathbf{ES}}\_{\mathfrak{r}}^{-}(Y|\mathbf{x}), \widehat{F}\_{Y|\mathbf{x}}^{-1}(\mathfrak{r}), \widehat{\mathbf{ES}}\_{\mathfrak{r}}^{+}(Y|\mathbf{x}) \right).$$

There remains the choice of the functions *G* and , such that is strictly convex and *Ge*−*,e*<sup>+</sup> , defined in (11.27), is strictly increasing. Section 2.3 in Fissler et al. [130] discusses possible choices. A simple choice is to select the identity function *G(y)* = *y* (which gives the pinball loss on the first line of (11.26)) and

$$
\Psi(e^-, e^+) = \psi\_1(e^-) + \psi\_2(e^+),
$$

with *ψ*<sup>1</sup> and *ψ*<sup>2</sup> strictly convex and with (sub-)gradients *ψ* <sup>1</sup> *>* 0 and *ψ* <sup>2</sup> *<* 0. Inserting this choice into (11.26) provides the loss function

$$L(\mathbf{y}; e^-, q, e^+) = \left[1 + \frac{\psi\_1'(e^-)}{\pi} + \frac{-\psi\_2'(e^+)}{1 - \pi}\right] L\_\mathbf{r}(\mathbf{y}, q) + D\_{\psi\_1}(\mathbf{y}, e^-) + D\_{\psi\_2}(\mathbf{y}, e^+), \tag{11.29}$$

where *Lτ (y, q)* is the pinball loss (5.81) and *Dψ*<sup>1</sup> and *Dψ*<sup>2</sup> are Bregman divergences (2.28). There remains the choices of *ψ*<sup>1</sup> and *ψ*<sup>2</sup> which should be strictly convex, the first one being strictly increasing and the second one being strictly decreasing.

We restrict ourselves to strictly convex functions *<sup>ψ</sup>* on the positive real line <sup>R</sup>+, i.e., for positive claims *Y >* 0, a.s. For *<sup>b</sup>* <sup>∈</sup> <sup>R</sup>, we consider the following functions on <sup>R</sup><sup>+</sup>

$$\boldsymbol{\psi}^{(b)}(\mathbf{y}) = \begin{cases} \frac{1}{b(b-1)} \mathbf{y}^b & \text{for } b \neq 0 \text{ and } b \neq 1, \\ -1 - \log(\mathbf{y}) & \text{for } b = 0, \\ \mathbf{y} \log(\mathbf{y}) - \mathbf{y} & \text{for } b = 1. \end{cases} \tag{11.30}$$

We compute the first and second derivatives. These are for *y >* 0 given by

$$\frac{\partial}{\partial \mathbf{y}} \boldsymbol{\psi}^{(b)}(\mathbf{y}) = \begin{cases} \frac{1}{b-1} \mathbf{y}^{b-1} & \text{for } b \neq 1, \\ \log(\mathbf{y}) & \text{for } b = 1, \end{cases} \qquad \text{and} \qquad \frac{\partial^2}{\partial \mathbf{y}^2} \boldsymbol{\psi}^{(b)}(\mathbf{y}) = \mathbf{y}^{b-2} > 0.$$

Thus, for any *<sup>b</sup>* <sup>∈</sup> <sup>R</sup> we have a convex function, and this convex function is decreasing on <sup>R</sup><sup>+</sup> for *b <* 1 and increasing for *b >* 1. Therefore, we have to select *b >* 1 for *ψ*<sup>1</sup> and *b <* 1 for *ψ*<sup>2</sup> to get suitable choices in (11.29). Interestingly, these choices correspond to Lemma 11.2 with power variance parameters *p* = 2 − *b*, i.e., they provide us with Bregman divergences from Tweedie's distributions. However, (11.30) is more general, because it allows us to select any *<sup>b</sup>* <sup>∈</sup> <sup>R</sup>, whereas for power variance parameters *p* ∈ *(*0*,* 1*)* there do not exist any Tweedie's distributions, see Theorem 2.18.

In view of Lemma 11.2 and using the fact that unit deviances d*<sup>p</sup>* are Bregman divergences, we select a power variance parameter *p* = 2 − *b >* 1 for *ψ*<sup>2</sup> and we select the Gaussian model *p* = 2 − *b* = 0 for *ψ*1. This gives us the special choice for the loss function (11.29) for strictly positive claims *Y >* 0, a.s.,

$$L(\mathbf{y}; e^-, q, e^+) = \left[1 + \frac{\eta\_1 e^-}{\tau} + \frac{\eta\_2 (e^+)^{1-p}}{(1-\tau)(p-1)}\right] L\_\tau(\mathbf{y}, q) + \frac{\eta\_1}{2} \mathfrak{d}\_0(\mathbf{y}, e^-) + \frac{\eta\_2}{2} \mathfrak{d}\_P(\mathbf{y}, e^+), \tag{11.31}$$

with the Gaussian unit deviance <sup>d</sup>0*(y, e*−*)* <sup>=</sup> *(y* <sup>−</sup>*e*−*)*<sup>2</sup> and Tweedie's unit deviance d*<sup>p</sup>* with power variance parameter *p >* 1, see Sect. 11.1.1. The additional constants *η*1*, η*<sup>2</sup> *>* 0 are used to balance the contributions of the individual terms to the total loss. Typically, we choose *p* ≥ 2 for the upper ES reflecting claim size models. This choice for *ψ*<sup>2</sup> implies that the residuals are weighted inversely proportional to the corresponding variances *μ<sup>p</sup>* within Tweedie's family, see (11.5). Using this loss function (11.31) in (11.28) allows us to estimate the composite triplet *(*ES− *<sup>τ</sup> (Y* <sup>|</sup>*x), F* <sup>−</sup><sup>1</sup> *<sup>Y</sup>* <sup>|</sup>*x(τ ),* ES<sup>+</sup> *<sup>τ</sup> (Y* |*x))* with a strictly consistent loss function.

## *11.3.2 Lab: Deep Composite Model Regression*

The joint elicitability of Theorem 11.11 allows us to directly estimate these functionals for a fixed quantile level *τ* ∈ *(*0*,* 1*)*. In a similar way to quantile regression we set up a FN network that respects the monotonicity ES− *<sup>τ</sup> (Y* |*x)* ≤ *F* <sup>−</sup><sup>1</sup> *<sup>Y</sup>* <sup>|</sup>*x(τ )* <sup>≤</sup> ES<sup>+</sup> *<sup>τ</sup> (Y* |*x)*. We set for the regression function in the additive approach for multi-task learning

*x* → ES− *<sup>τ</sup> (Y*|*x), F* <sup>−</sup><sup>1</sup> *<sup>Y</sup>*|*<sup>x</sup> (τ ),* ES<sup>+</sup> *<sup>τ</sup> (Y*|*x)* - = *<sup>g</sup>*−<sup>1</sup>*<sup>β</sup>*1*, <sup>z</sup>(d*:1*) (x) , g*−<sup>1</sup>*<sup>β</sup>*1*, <sup>z</sup>(d*:1*) (x)*<sup>+</sup> *<sup>g</sup>*−<sup>1</sup> <sup>+</sup> *<sup>β</sup>*2*, <sup>z</sup>(d*:1*) (x) ,* (11.32) *<sup>g</sup>*−<sup>1</sup>*<sup>β</sup>*1*, <sup>z</sup>(d*:1*) (x)*<sup>+</sup> *<sup>g</sup>*−<sup>1</sup> <sup>+</sup> *<sup>β</sup>*2*, <sup>z</sup>(d*:1*) (x)*<sup>+</sup> *<sup>g</sup>*−<sup>1</sup> <sup>+</sup> *<sup>β</sup>*3*, <sup>z</sup>(d*:1*) (x)* - <sup>∈</sup> <sup>A</sup>*,*

for link functions *<sup>g</sup>* and *<sup>g</sup>*<sup>+</sup> with *<sup>g</sup>*−<sup>1</sup> <sup>+</sup> <sup>≥</sup> 0, deep FN network *<sup>z</sup>(d*:1*)* : <sup>R</sup>*q*0+<sup>1</sup> <sup>→</sup> <sup>R</sup>*qd*+1, regression parameters *<sup>β</sup>*1*, <sup>β</sup>*2*, <sup>β</sup>*<sup>3</sup> <sup>∈</sup> <sup>R</sup>*qd*+1, and with the action space <sup>A</sup> = {*(e*−*,q,e*+*)* <sup>∈</sup> <sup>R</sup><sup>3</sup> +; *e*<sup>−</sup> ≤ *q* ≤ *e*+} for positive claims. We also remind of Remark 11.10 for a different way of modeling the monotonicity.

Fitting this model is similar to the multiple deep quantile regression presented in Listings 11.3 and 11.5. There is one important difference though. Namely, we do not have multiple outputs and multiple loss functions, but we have a threedimensional output with a single loss function (11.31) simultaneously evaluating all three components of the output (11.32). Listing 11.6 gives this loss for the inverse Gaussian case *p* = 3 in (11.31).

**Listing 11.6** Loss function (11.31) for *p* = 3

```
1 Bregman_IG = function(y_true, y_pred){
2 k_mean( (k_maximum(y_true[,1]-y_pred[,2],0)*tau0 +
3 k_maximum(y_pred[,2]-y_true[,1],0)*(1-tau0) ) * 4 ( 1 + eta1*y_pred[,1]/tau0 + eta2*y_pred[,3]^(-2)/(2*(1-tau0)) ) +
5 eta1*(y_true[,1]-y_pred[,1])^2/2 +
6 eta2*((y_true[,1]-y_pred[,3])^2/(y_pred[,3]^2*y_true[,1]))/2 )}
```
We revisit the Swiss accident insurance data of Sect. 11.2.3. We again use a FN network of depth *d* = 3 with *(q*1*, q*2*, q*3*)* = *(*20*,* 15*,* 10*)* neurons, hyperbolic tangent activation, two-dimensional embedding layers for the categorical features, exponential output activations for *g*−<sup>1</sup> and *g*−<sup>1</sup> <sup>+</sup> , and the additive structure (11.32). We implement the loss function (11.31) for quantile level *τ* = 90% and with power variance parameter *p* = 3, see Listing 11.6. This implies that for the upper ES estimation we scale residuals with *V (μ)* <sup>=</sup> *<sup>μ</sup>*3, see (11.5). We then run an initial calibration of this FN network. Based on this initial calibration we can calculate the three loss contributions in (11.31) coming from the composite triplet. Based on these figures we choose the constants *η*1*, η*<sup>2</sup> *>* 0 in (11.31) so that all three terms of the composite triplet contribute equally to the total loss. For the remainder of our calibration we hold on to these choices of *η*<sup>1</sup> and *η*2.

We calibrate this deep FN architecture to the learning data *L*, using the strictly consistent loss function (11.31) for the composite triplet *(*ES− 90%*(Y* <sup>|</sup>*x), F* <sup>−</sup><sup>1</sup> *<sup>Y</sup>* <sup>|</sup>*<sup>x</sup> (*90%*),* ES+ 90%*(Y* |*x))*, and to reduce the randomness in prediction we average over 20 early stopped SGD calibrations with different seeds (nagging predictor).

Figure 11.10 shows the estimated lower and upper ES against the corresponding 90%-quantile estimates for 2'000 randomly selected insurance claims *x*† *<sup>t</sup>* . The diagonal orange line shows the estimated 90%-quantiles *F* −1 *<sup>Y</sup>* <sup>|</sup>*x*† *t (*90%*)*, and the cyan lines give spline fits to the estimated lower and upper ES. It is clearly visible that these respect the ordering

$$\widehat{\operatorname{ES}}\_{90\%}^{-}(Y|\mathbf{x}\_{l}^{\dagger}) \leq \widehat{F}\_{Y|\mathbf{x}\_{l}^{\dagger}}^{-1}(90\%) \leq \widehat{\operatorname{ES}}\_{90\%}^{+}(Y|\mathbf{x}\_{l}^{\dagger}),$$

for fixed features *x*† *<sup>t</sup>* ∈ *X*.

The deep quantile regression has been back-tested using the coverage ratios (11.22). Back-testing the ES is more difficult, the standalone ES is not elicitable, and the ES can only be back-tested jointly with the corresponding quantile. The part of the joint identification function that corresponds to the ES is given by, see (4.2)–(4.3) in Fissler et al. [130],

$$\widehat{v}\_{-} = \frac{1}{T} \sum\_{t=1}^{T} \widehat{\operatorname{ES}}\_{\mathbf{r}}^{-}(\operatorname{Y}|\mathbf{x}\_{t}^{\dagger}) - \frac{\operatorname{Y}\_{l}^{\dagger}\mathbbm{1}\Big{(}\operatorname{Y}\_{l}^{\dagger} \leq \widehat{F}\_{\operatorname{Y}|\mathbf{x}\_{l}^{\dagger}}^{-1}(\operatorname{\mathbf{r}})\Big{)} + \frac{\widehat{\operatorname{\mathcal{F}}}\_{\operatorname{Y}|\mathbf{x}\_{l}^{\dagger}}^{-1}(\operatorname{\mathbf{r}})\Big{(}\operatorname{\mathbf{r}} - \mathbbm{1}\Big{(}\operatorname{\mathbf{r}}\_{l}^{\dagger} \leq \widehat{F}\_{\operatorname{Y}|\mathbf{x}\_{l}^{\dagger}}^{-1}(\operatorname{\mathbf{r}})\Big{)}\Big{)}}{\pi},\tag{11.33}$$

and

$$\widehat{v}\_{+} = \frac{1}{T} \sum\_{t=1}^{T} \widehat{\operatorname{ES}}\_{\mathbf{r}}^{+}(Y|\mathbf{x}\_{t}^{\dagger}) - \frac{Y\_{l}^{\dagger}\mathbb{1}\Big{|}\Big{1}\_{\left\{Y\_{l}^{\dagger} > \widehat{F}\_{Y|\mathbf{x}\_{l}^{\dagger}}^{-1}(\mathbf{r})\right\}} + \frac{\widehat{F}\_{Y|\mathbf{x}\_{l}^{\dagger}}^{-1}(\mathbf{r})\Big{(}\mathbbm{1}\Big{1}\_{\left\{Y\_{l}^{\dagger} \leq \widehat{F}\_{Y|\mathbf{x}\_{l}^{\dagger}}^{-1}(\mathbf{r})\right\}} - \tau\Big{)}}{1 - \tau}.\tag{11.34}$$

These (empirical) identifications should be close too zero if the model fits the data.

Remark that the latter terms in (11.33)–(11.34) describe the lower and upper ES also in the case of non-continuous distribution functions because we have the identity

$$\mathbb{E}\mathbf{S}\_{\mathbf{r}}^{-}(Y|\mathbf{x}) = \frac{1}{\tau} \left( \mathbb{E}\left[ \left. Y \mathbb{1}\_{\left\{ Y \leq F\_{Y|\mathbf{x}}^{-1}(\mathbf{r}) \right\}} \right| \mathbf{x} \right] + F\_{Y|\mathbf{x}}^{-1}(\mathbf{r}) \left( \tau - F\_{Y|\mathbf{x}} \left( F\_{Y|\mathbf{x}}^{-1}(\mathbf{r}) \right) \right) \right), \tag{11.35}$$

the second term being zero for a continuous distribution *FY* <sup>|</sup>*x*, but it is needed for non-continuous distribution functions.

We compare the deep composite regression results of this section to the deep gamma and inverse Gaussian models using a double FN network for dispersion modeling, see Sect. 11.1.3. This requires to calculate the ES in the gamma and the inverse Gaussian models. This can be done within the EDF, see Landsman–Valdez [233]. The upper ES in the gamma model *Y* ∼ *(α, β)* is given by, see (6.47),

$$\mathbb{E}\left[Y\left|Y>F\_Y^{-1}(\tau)\right]\right] = \frac{\alpha}{\beta} \left(\frac{1-\mathcal{G}\left(\alpha+1, \beta F\_Y^{-1}(\tau)\right)}{1-\tau}\right),$$

where *<sup>G</sup>* is the scaled incomplete gamma function (6.48) and *<sup>F</sup>* <sup>−</sup><sup>1</sup> *<sup>Y</sup> (τ )* is the *τ* quantile of *(α, β)*.

Example 4.3 of Landsman–Valdez [233] gives the inverse Gaussian case (2.8) with *α, β >* 0

$$\begin{split} \mathbb{E}\left[Y\,\middle|\,Y>\,F\_{Y}^{-1}(\tau)\right] &= \frac{\alpha}{\beta}\left(1+\frac{1/\alpha}{1-\tau}\sqrt{F\_{Y}^{-1}(\tau)}\varphi(z\_{\tau}^{(1)})\right) \\ &+\frac{\alpha}{\beta}\frac{1/\alpha}{1-\tau}e^{2\alpha\beta}\left(2\alpha\Phi(-z\_{\tau}^{(2)})-\sqrt{F\_{Y}^{-1}(\tau)}\varphi(-z\_{\tau}^{(2)})\right), \end{split}$$

where *ϕ* and *!* are the standard Gaussian density and distribution, respectively, *F* <sup>−</sup><sup>1</sup> *<sup>Y</sup> (τ )* is the *τ* -quantile of the inverse Gaussian distribution and

$$z\_{\tau}^{(1)} = \frac{\alpha}{\sqrt{F\_Y^{-1}(\tau)}} \left( \frac{F\_Y^{-1}(\tau)}{\alpha/\beta} - 1 \right) \qquad \text{and} \qquad z\_{\tau}^{(2)} = \frac{\alpha}{\sqrt{F\_Y^{-1}(\tau)}} \left( \frac{F\_Y^{-1}(\tau)}{\alpha/\beta} + 1 \right) \dots$$

This now allows us to calculate the identifications (11.33)–(11.34) in the fitted deep double networks using the gamma and the inverse Gaussian distributions of Sect. 11.1.3.

Table 11.8 shows the out-of-sample coverage ratios and the identifications of the deep composite regression and the two distributional approaches. These figures suggest that the gamma model is not competitive; the deep composite model has the most precise coverage ratio. In terms of the ES identification terms, the deep


**Table 11.8** Out-of-sample coverage ratios *<sup>τ</sup>* and identifications *<sup>v</sup>*<sup>−</sup> and *<sup>v</sup>*<sup>+</sup> of the deep composite regression model and the deep double networks in the gamma and inverse Gaussian cases

composite model and the double network with inverse Gaussian claim sizes are comparably accurate (out-of-sample) determining the lower and upper 90% ES. Finally, we paste the lower and upper ES from the deep composite regression model according to (11.25). This gives us an estimated mean (under a continuous distribution function)

$$
\widehat{\mu}(\mathfrak{x}) = \widehat{\mathbb{E}}[Y|\mathfrak{x}] = \mathfrak{r} \widehat{\operatorname{ES}}\_{\mathfrak{r}}^{-}(Y|\mathfrak{x}) + (1 - \mathfrak{r}) \widehat{\operatorname{ES}}\_{\mathfrak{r}}^{+}(Y|\mathfrak{x}).
$$

Figure 11.11 compares these estimates of the deep composite regression model to the deep double inverse Gaussian model estimates. The black dots show 2'000 randomly selected claims *x*† *<sup>t</sup>* , and the cyan line gives a spline fit to all out-of-sample claims in *T* . The body of the estimates is rather similar in both approaches but the deep composite approach provides more large estimates, the dotted orange lines show the maximum estimate from the deep double inverse Gaussian model.

We conclude that in the case where no member of the EDF reflects the properties of the data in the tail, the deep composite regression approach presented in this section provides an alternative method for mean estimation that allows for separate models in the main body and the tail of the data. Fixing the quantile level allows for a straightforward fitting in one step, this is in contrast to the composite models where we fix the splicing point. The latter approaches are more difficult in fitting, e.g., using the EM algorithm.

## **11.4 Model Uncertainty: A Bootstrap Approach**

As described in Sect. 4, there are different sources of prediction uncertainty when forecasting random variables. There is the irreducible risk that comes from the fact that we try to predict random variables. This source of uncertainty is always present, even if we know the true data generating mechanism, i.e., it is irreducible. In most applied situations we do not know the true data generating mechanism which results in additional prediction uncertainty. Within GLMs this source of uncertainty has mainly been allocated to parameter estimation uncertainty deriving from the fact that we estimate the parameters from a finite sample, we refer to Sects. 3.4 and 11.1.4 on asymptotic results. In network modeling, the situation is more complicated. Firstly, we have seen that there is no best network regression model even if the architecture and the hyper-parameters are fully specified. In Fig. 7.18 we have seen that in a claim frequency context the different solutions from an early stopped SGD fitting can have a coefficient of variation of up to 40% on the individual policy level, on average these coefficients of variation were around 10%. This has led to the consideration of network ensembling and the nagging predictor in Sect. 7.4.4. These considerations have been based on a fixed learning data set *L*. In this section, we assume that also the learning data set *L* may look differently by considering different realizations of the (randomly generated) observations *Yi*. To reflect this source of randomness in outcomes we bootstrap new data from *L* by exploring a non-parametric bootstrap with random drawings with replacements from *L*, see Sect. 4.3.1. This will allow us to study the volatility implied in estimation by considering a different set of observations, i.e., a different sample.

Ideally we would like to generate new observations from the true data generating mechanism, but, since this mechanism is not known, we can at best generate data from an estimated model. If we rely on a distributional model, we may suffer from model error, e.g., in Sect. 11.3 we have seen that it is rather difficult to specify a distributional regression model that has the right tail behavior. Therefore, we may give preference to a distribution-free approach. Non-parametric bootstrapping is such a distribution-free approach, the disadvantage being that we cannot enrich the existing observations by new observations, but we can only rearrange the available observations.

We revisit the robust representation learning approach of Sect. 11.1.2 on the same Swiss accident insurance data as explored in that section. In particular, we reconsider the deep multi-output models introduced in (11.6) and studied in Table 11.3 for power variance parameters *p* = 2*,* 2*.*5*,* 3 (and constant dispersion parameter). We perform exactly the same analysis, here, however we consider for this analysis bootstrapped data *L*<sup>∗</sup> for model fitting.

First, we fit 100 times the same deep FN network architecture as in (11.6) with different seeds (on identical learning data *L*). From this we calculate the nagging predictor. Second, we generate 100 different bootstrap samples *L*<sup>∗</sup> = *<sup>L</sup>*∗*(s)*, 1 <sup>≤</sup> *<sup>s</sup>* <sup>≤</sup> 100, from *<sup>L</sup>* (having an identical sample size) with random drawings with replacements, and we fit the same network architecture to these 100


**Table 11.9** Out-of-sample losses (gamma loss, power variance case *<sup>p</sup>* <sup>=</sup> <sup>2</sup>*.*5 loss (in 10−2) and inverse Gaussian (IG) loss (in 10−3)) and average claim amounts; the losses use unit dispersion *ϕ* = 1

bootstrap samples. We then also average over these 100 predictors obtained from the different bootstrap samples. Table 11.9 provides the resulting out-of-sample deviance losses on the test data *T* . We always hold on to the same test data *T* which is disjoint/independent from the learning data *L* and the bootstrap samples *<sup>L</sup>*<sup>∗</sup> <sup>=</sup> *<sup>L</sup>*∗*(s)*, 1 <sup>≤</sup> *<sup>s</sup>* <sup>≤</sup> 100.

The nagging predictors over 100 seeds are roughly the same as over 20 seeds (see Table 11.3), which indicates that 20 different network fits suffice, here. Interestingly, the average bootstrapped version generally improves the nagging predictors. Thus, here the average bootstrap predictor provides a better balance among the observations to receive superior predictive power on the test data *T* , compare lines 'nagging 100' vs. 'bootstrap 100' of Table 11.9.

The main purpose of this analysis is to understand the volatility involved in nagging and bootstrap predictors. We therefore consider the coefficients of variation Vco*t* introduced in (7.43) on individual policies 1 ≤ *t* ≤ *T* . Figure 11.12 shows these coefficients of variation on the individual predictors, i.e., for the individual claims *x*† *<sup>t</sup>* and the individual network calibrations with different seeds. The left-hand side gives the coefficients of variation based on 100 bootstrap samples, the right-hand side gives the coefficients of variation of 100 predictors fitted on the same data *L* but with different seeds for the SGD algorithm; the *y*-scale is identical in both plots. We observe that the coefficients of variation are clearly higher under the bootstrap approach compared to holding on to the same data *L* for SGD fitting with different seeds. Thus, the nagging predictor averages over the randomness in different seeds for network calibrations, whereas bootstrapping additionally considers possible different samples *L*<sup>∗</sup> for model learning. We analyze the difference in magnitudes in more detail.

Figure 11.13 compares the two coefficients of variation for different claim sizes. The average coefficient of variation for fixed observations *L* is 15.9% (cyan columns). This average coefficient of variation is increased to 24.8% under bootstrapping

**Fig. 11.12** Coefficients of variation in individual estimators (lhs) bootstrap 100, and (rhs) nagging 100; the *y*-scale is identical in both plots

**Fig. 11.13** Coefficients of variation in individual predictors of the bootstrap and the nagging approaches (ordered w.r.t. estimated claim sizes)

(orange columns). The blue line shows the average relative increase for the different claim sizes (right axis), and the blue dotted line is at a relative increase of 40%. From Fig. 11.13 we observe that this spread (relative increase) is rather constant across all claim predictions; we remark that 93.5% of all claim predictions are below 5'000. Thus, most claims are at the left end of Fig. 11.13.

From this small analysis we conclude that there is substantial model and estimation uncertainty involved, recall that we fit the deep network architecture to 305'550 individual claims having 7 feature components, this is a comparably large portfolio. On average, we have a coefficient of variation of 15% implied by SGD fitting with different seeds, and this coefficient of variation is increased to roughly 25% under additionally bootstrapping the observations. This is considerable, and it requires that we ensemble these predictors to receive more robust predictions. The results of Table 11.9 support this re-sampling and ensembling approach as we receive a better out-of-sample performance.

## **11.5 LocalGLMnet: An Interpretable Network Architecture**

Network architectures are often criticized for not being (sufficiently) explainable. Of course, this is not fully true as we have gained a lot of insight about the data examples studied in this book. This criticism of non-explainability has led to the development of the post-hoc model-agnostic tools studied in Sect. 7.6. This approach has been questioned at many places, and it is not clear whether one should try to explain black box models, or whether one should rather try to make the models interpretable in the first place, see, e.g., Rudin [322]. In this section we take this different approach by working with a network architecture that is (more) interpretable. We present the LocalGLMnet proposal of Richman–Wüthrich [317, 318]. This approach allows for interpreting the results, and it allows for variable selection either using an empirical Wald test or LASSO regularization.

There are different other proposals that try to achieve similar explainability in specific network architectures. There is the explainable neural network of Vaughan et al. [367] and the neural additive model of Agarwal et al. [3]. These proposals rely on parallel networks considering one single variable at a time. Of course, this limits their performance because of a missing interaction potential. This has been improved in the Combined Actuarial eXplainable Neural Network (CAXNN) approach of Richman [314], which requires a manual specification of parallel networks for potential interactions. The LocalGLMnet, proposed in this section, does not require any manual engineering, and it still possesses the universal approximation property.

## *11.5.1 Definition of the LocalGLMnet*

Starting point of the LocalGLMnet is a classical GLM. Choose a strictly monotone and smooth link function *g*. A GLM is received by considering the regression function

$$\mathbf{x} \mapsto \mathbf{g}(\mu(\mathbf{x})) = \beta\_0 + \langle \boldsymbol{\beta}, \mathbf{x} \rangle = \beta\_0 + \sum\_{j=1}^{q} \beta\_j \mathbf{x}\_j,\tag{11.36}$$

for features *<sup>x</sup>* <sup>∈</sup> *<sup>X</sup>* <sup>⊂</sup> <sup>R</sup>*<sup>q</sup>* , intercept *<sup>β</sup>*<sup>0</sup> <sup>∈</sup> <sup>R</sup> and regression parameter *<sup>β</sup>* <sup>∈</sup> R*<sup>q</sup>* . Compared to (5.5) we change the notation in this section by excluding the intercept component from the feature *x* = *(x*1*,...,xq )*-, because this will be more convenient for the LocalGLMnet proposal. The beauty of this GLM regression function is that we obtain a linear function after applying the link function *g*. This linear function is considered to be explainable as we can precisely quantify how much the expected response will change by slightly changing one of the feature components *xj* . In particular, this holds true for the log-link which leads to a multiplicative structure in the expected response.

The idea is to hold on to this additive structure (11.36) as far as possible, still trying to benefit from the universal approximation property of network architectures. Richman–Wüthrich [317] propose the following regression structure.

**Definition 11.12 (LocalGLMnet)** Choose a FN network architecture *<sup>z</sup>(d*:1*)* : <sup>R</sup>*<sup>q</sup>* <sup>→</sup> <sup>R</sup>*<sup>q</sup>* of depth *<sup>d</sup>* <sup>∈</sup> <sup>N</sup> with equal input and output dimensions to model the *regression attention*

$$\begin{aligned} \mathfrak{g} &: \mathbb{R}^q \to \mathbb{R}^q \\ \mathfrak{x} &\mapsto \mathfrak{g}(\mathfrak{x}) \stackrel{\text{def.}}{=} \mathfrak{z}^{(d:\mathbb{I})}(\mathfrak{x}) = \left( \mathfrak{z}^{(d)} \circ \cdots \circ \mathfrak{z}^{(\mathbb{I})} \right)(\mathfrak{x}). \end{aligned}$$

The *LocalGLMnet* is defined by the generalized *additive decomposition*

$$\mathbf{x} \mapsto \mathbf{g}\left(\mu(\mathbf{x})\right) = \beta\_0 + \langle \boldsymbol{\beta}(\mathbf{x}), \mathbf{x} \rangle = \beta\_0 + \sum\_{j=1}^{q} \beta\_j(\mathbf{x}) x\_j,$$

for a strictly monotone and smooth link function *g*.

This architecture is called LocalGLMnet because locally, around a given feature value *x*, it can be understood as a GLM, supposed that *β(x)* does not change too much in the environment of *x*. In the GLM context *β* is called *regression parameter*, and in the LocalGLMnet context *β(x)* is called *regression attention* because the components *βj (x)* determine how much attention there should be given to a specific value *xj* . We highlight this in the following discussion. Select one component 1 ≤ *j* ≤ *q* and study the individual term

$$\mathbf{x} \mapsto \beta\_f(\mathbf{x})\mathbf{x}\_f. \tag{11.37}$$


$$\nabla\_{\mathbf{x}}\beta\_{j}(\mathbf{x}) = \left(\frac{\partial}{\partial x\_{1}}\beta\_{j}(\mathbf{x}), \dots, \frac{\partial}{\partial x\_{q}}\beta\_{j}(\mathbf{x})\right)^{\top} \in \mathbb{R}^{q}.\tag{11.38}$$

The *j* -th component of ∇*xβj (x)* determines the (non-)linearity in term *xj* , the components different from *j* describe the interactions of term *xj* with the other components.

(5) These interpretations need some care because we do not have identifiability. For the special regression attention *βj (x)* = *xj /xj* we have

$$
\beta\_f(\mathbf{x}) x\_f = x\_{f'}.\tag{11.39}
$$

Therefore, we talk about *terms* in items (1)–(4), e.g., item (1) means that the term *βj (x)xj* can be dropped, however, the feature component *xj* may still play a significant role in some of the regression attentions *βj (x)*, *j* = *j* .

In practical applications we have not experienced identifiability issue (11.39). Having already the linear terms in the LocalGLMnet regression structure and starting the SGD fitting in the GLM gives already quite pre-determined regression functions, and the LocalGLMnet is built around this initialization, hardly falling into a completely different model (11.39).

(6) The LocalGLMnet architecture has the universal approximation property discussed in Sect. 7.2.2, because networks can approximate any continuous function arbitrarily well on a compact support for sufficiently large networks. We can then select one component, say, *x*<sup>1</sup> and let *β*1*(x)* = *z (d*:1*)* <sup>1</sup> *(x)* approximate a given continuous function *f (x)/x*1, i.e., *f (x)* ≈ *β*1*(x)x*<sup>1</sup> arbitrarily well on the compact support.

## *11.5.2 Variable Selection in LocalGLMnets*

The LocalGLMnet allows for variable selection through the regression attentions *βj (x)*. Roughly speaking, if the estimated regression attentions *β <sup>j</sup> (x)* <sup>≈</sup> 0, then the term *βj (x)xj* can be dropped. We can also explore whether the entire variable *xj* should be dropped (not only the corresponding term *βj (x)xj* ). For this, we have to refit the LocalGLMnet excluding the feature component *xj* . If the out-of-sample performance on validation data does not change, then *xj* also does not play an important role in any other regression attention *βj (x)*, *j* = *j* , and it should be completely dropped from the model.

In GLMs we can either use the Wald test or the LRT to test a null hypothesis *H*<sup>0</sup> : *βj* = 0, see Sect. 5.3. We explore a similar idea in this section, however, empirically. We therefore first need to ensure that all feature components live on the same scale. We consider standardization with the empirical mean and the empirical standard deviation, see (7.30), and from now on we assume that all feature components are centered and have unit variance. Then, the main problem is to determine whether an estimated regression attention *β <sup>j</sup> (x)* is significantly different from 0 or not.

We therefore extend the features *x*<sup>+</sup> = *(x*1*,...,xq , xq*+1*)*- <sup>∈</sup> <sup>R</sup>*q*+<sup>1</sup> by an additional independent and purely random component *xq*+<sup>1</sup> that is also standardized. Since this additional component is independent of all other components it cannot have any predictive power for the response under consideration, thus, fitting this extended model should result in a regression attention *β <sup>q</sup>*+1*(x*+*)* <sup>≈</sup> 0. The estimate will not be exactly zero, because there is noise involved, and the magnitude of this fluctuation will determine the rejection/acceptance region of the null hypothesis of not being significant.

We fit the LocalGLMnet to the learning data *L* with features *x*<sup>+</sup> *<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*q*+<sup>1</sup> extended by the standardized i.i.d. component *xi,q*+<sup>1</sup> being independent of *(Yi, xi)*. This gives us the estimated regression attentions *β* <sup>1</sup>*(x*<sup>+</sup> *<sup>i</sup> ), . . . , β <sup>q</sup> (x*<sup>+</sup> *<sup>i</sup> ), β <sup>q</sup>*+1*(x*<sup>+</sup> *i )*. We compute the empirical mean and standard deviation of the attention weight of the additional component *xq*+<sup>1</sup>

$$\bar{b}\_{q+1} = \frac{1}{n} \sum\_{i=1}^{n} \widehat{\beta}\_{q+1}(\mathbf{x}\_{i}^{+}) \qquad \text{and} \qquad \widehat{s}\_{q+1} = \sqrt{\frac{1}{n-1} \sum\_{l=1}^{n} \left(\widehat{\beta}\_{q+1}(\mathbf{x}\_{l}^{+}) - \bar{b}\_{q+1}\right)^{2}}. \tag{11.40}$$

We expect approximate centering *b*¯ *<sup>q</sup>*+<sup>1</sup> ≈ 0 because this additional component *xq*+<sup>1</sup> does not enter the true regression function, and the empirical standard deviation *sq*+<sup>1</sup> quantifies the expected fluctuation around zero of insignificant components.

We can now test the null hypothesis *H*<sup>0</sup> : *βj (x)* = 0 of component *j* on significance level *α* ∈ *(*0*,* 1*/*2*)*. We define centered interval

$$I\_{\alpha} = \left[ \Phi^{-1}(\alpha/2) \cdot \widehat{s}\_{q+1}, \ \Phi^{-1}(1 - \alpha/2) \cdot \widehat{s}\_{q+1} \right],\tag{11.41}$$

where *!*−1*(p)* denotes the standard Gaussian quantile for *<sup>p</sup>* <sup>∈</sup> *(*0*,* <sup>1</sup>*)*. *<sup>H</sup>*<sup>0</sup> should be rejected if the coverage ratio of this centered interval *Iα* is substantially smaller than 1 − *α*, i.e.,

$$\frac{1}{n} \sum\_{i=1}^n \mathbf{1}\_{\{\widehat{\beta}\_f(\mathbf{x}\_i^+) \in I\_\alpha\}} < 1 - \alpha.$$

This proposal is designed for continuous feature components, and categorical variables are discussed in Sect. 11.5.4, below. For *xq*+<sup>1</sup> we can choose a standard Gaussian distribution, a normalized uniform distribution or we can randomly permute one of the feature components *xi,j* across the entire portfolio 1 ≤ *i* ≤ *n*. Usually, the resulting empirical standard deviations *sq*+<sup>1</sup> are rather similar.

## *11.5.3 Lab: LocalGLMnet for Claim Frequency Modeling*

We revisit the French MTPL data example. We compare the LocalGLMnet approach to the deep FN network considered in Sect. 7.3.2, and we benchmark with the results of Table 7.3; we benchmark with the crudest FN network from above because, at the current stage, we need one-hot encoding for the LocalGLMnet approach. The analysis in this section is the same as in Richman–Wüthrich [317].

The French MTPL data has 6 continuous feature components (we treat Area as a continuous variable), 1 binary component and 2 categorical components. We preprocess the continuous and binary variables to centering and unit variance using standardization (7.30). This will allow us to do variable selection as presented in (11.41). The categorical variables with more than two levels are more difficult. In a first attempt we use one-hot encoding for the categorical variables. We prefer one-hot encoding over dummy coding because this ensures that for all levels there is a component *xj* with *xj* = 0. This is important because the terms *βj (x)xj* are equal to zero for the reference level in dummy coding (since *xj* = 0). This does not allow us to study interactions with other variables for the term corresponding to the reference level. Remark that one-hot encoding and dummy coding do not lead to centering and unit variance.

This feature pre-processing gives us a feature vector *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* of dimension *q* = 40. For variable selection of the continuous and binary components we extend the feature *x* by two additional independent components *xq*+<sup>1</sup> and *xq*+2. We select two components to explore whether the particular distributional choice has some influence on the choice of the acceptance/rejection interval *Iα* in (11.41). We choose for policies 1 ≤ *i* ≤ *n*

$$\text{x}\_{l,q+1} \overset{\text{i.i.d.}}{\sim} \text{Uniform}\left[ -\sqrt{3}, \sqrt{3} \right] \qquad\text{and}\qquad \text{x}\_{l,q+2} \overset{\text{i.i.d.}}{\sim} \mathcal{N}(0,1),$$

these two sets of variables being mutually independent, and being independent from all other variables. We define the extended features *x*+ *i* = *(xi,*1*,...,xi,q , xi,q*+<sup>1</sup>*, xi,q*+2*)* - <sup>∈</sup> <sup>R</sup>*q*<sup>0</sup> with *<sup>q</sup>*<sup>0</sup> <sup>=</sup> *<sup>q</sup>* <sup>+</sup> 2, and we consider the LocalGLMnet regression function

$$\mathbf{x}^+ \mapsto \log\left(\mu(\mathbf{x}^+)\right) = \beta\_0 + \sum\_{j=1}^{q\_0} \beta\_j(\mathbf{x}^+) \mathbf{x}\_j.$$

We choose the log-link for Poisson claim frequency modeling. The time exposure *v >* 0 can either be integrated as a weight to the EDF or as an offset on the canonical scale resulting in the same Poisson model, see Sect. 5.2.3.

**Listing 11.7** LocalGLMnet architecture

```
1 Design = layer_input(shape = c(42), dtype = 'float32', name = 'Design')
2 Vol = layer_input(shape = c(1), dtype = 'float32', name = 'Vol')
3 #
4 Attention = Design %>%
5 layer_dense(units=20, activation='tanh', name='FNLayer1') %>%
6 layer_dense(units=15, activation='tanh', name='FNLayer2') %>%
7 layer_dense(units=10, activation='tanh', name='FNLayer3') %>%
8 layer_dense(units=42, activation='linear', name='Attention')
9 #
10 LocalGLM = list(Design, Attention) %>% layer_dot(name='LocalGLM', axes=1) %>%
11 layer_dense(units=1, activation='exponential', name='Balance')
12 #
13 Response = list(LocalGLM, Vol) %>% layer_multiply(name='Multiply')
14 #
15 keras_model(inputs = c(Design, Vol), outputs = c(Response))
```
We are now ready to define the LocalGLMnet architecture. We choose a network *<sup>z</sup>(d*:1*)* : <sup>R</sup>*q*<sup>0</sup> <sup>→</sup> <sup>R</sup>*q*<sup>0</sup> of depth *<sup>d</sup>* <sup>=</sup> 4 with *(q*1*, q*2*, q*3*, q*4*)* <sup>=</sup> *(*20*,* <sup>15</sup>*,* <sup>10</sup>*,* <sup>42</sup>*)* neurons. The R code is given in Listing 11.7. We note that this is not much more involved than a plain-vanilla FN network. Slightly special in this implementation is the integration of the intercept *β*<sup>0</sup> on line 11. Naturally, we would like to add this intercept, however, there is no simple code for doing this. For that reason, we model the additive decomposition by

$$\mathbf{x}^+ \mapsto \log\left(\mu(\mathbf{x}^+)\right) = \alpha\_0 + \alpha\_1 \sum\_{j=1}^{q\_0} \beta\_j(\mathbf{x}^+) \mathbf{x}\_j,$$

with real-valued parameters *α*<sup>0</sup> and *α*<sup>1</sup> being estimated on line 11 of Listing 11.7. Thus, in this implementation the regression attentions are obtained by *α*1*βj (x*+*)*. Of course, there are also other ways of implementing this. This LocalGLMnet architecture has 1'799 network weights to be fitted.

We fit this LocalGLMnet using a training to validation data split of 8 : 2 and a batch size of 5'000. We initialize the gradient descent algorithm such that we exactly start in the GLM with *βj (x*+*)* ≡ *β* MLE *<sup>j</sup>* . For this we set all weights in the last layer on line 8 of Listing 11.7 to zero, *w(d) l,j* = 0, and the corresponding intercepts to the MLEs of the GLM, i.e., *w(d)* <sup>0</sup>*,j* = *β* MLE *<sup>j</sup>* . This gives us the GLM initialization *q*<sup>0</sup> *<sup>j</sup>*=<sup>1</sup> *<sup>β</sup>* MLE *<sup>j</sup> xj* on line 10 of Listing 11.7. Moreover, on line 11 of that listing, we initialize *α*<sup>1</sup> = 1 and *α*<sup>0</sup> = *β* MLE <sup>0</sup> . This implies that the gradient descent algorithm starts in the MLE estimated GLM. The SGD fitting turns out to be faster than in the plain-vanilla FN case, probably, because we start in the GLM having already the reasonable linear terms *xj* in the model, and we only need to find the regression attentions *βj (x*+*)* around these linear terms. The results are presented on the second last line of Table 11.10. The out-of-sample results are slightly worse than in the plain-vanilla FN case. There are many reasons for that, for instance, many levels in one-hot encoding may lead to more potential for over-fitting, and hence to an earlier


**Table 11.10** Run times, number of parameters, in-sample and out-of-sample deviance losses (units are in 10−2) and in-sample average frequency of the Poisson regressions, see also Table 7.3

stopping, here. The same applies if we add too many purely random components *xq*+*l*, *l* ≥ 1. Since the balance property will not hold, in general, we apply the bias regularization step (7.33) to adjust *α*<sup>0</sup> and *α*1, the results are presented on the last line of Table 11.10; in Remark 3.1 of Richman–Wüthrich [317] a more sophisticated balance property correction is presented. Our goal now is to analyze this solution.

**Listing 11.8** Extracting the regression attentions from the LocalGLMnet architecture


We start by analyzing the two additional components *xi,q*+<sup>1</sup> and *xi,q*+<sup>2</sup> being uniformly and Gaussian distributed, respectively. Listing 11.8 shows how to extract the estimated regression attentions *<sup>β</sup>(x*<sup>+</sup> *<sup>i</sup> )*. We calculate the means and standard deviations of the estimated regression attentions of the two additional components

$$b\_{q+1} = 0.0042 \qquad \text{and} \qquad b\_{q+2} = 0.0213,$$

and

$$
\widehat{s}\_{q+1} = 0.0516 \qquad \text{and} \qquad \widehat{s}\_{q+2} = 0.0482.
$$

From these numbers we see that the regression attentions *β <sup>q</sup>*+2*(xi)* are slightly biased, whereas *β <sup>q</sup>*+1*(xi)* are fairly centered compared to the magnitudes of the standard deviations. If we select a significance level of *α* = 0*.*1%, we receive a two-sided standard normal quantile of <sup>|</sup>*!*−1*(α/*2*)*| = <sup>3</sup>*.*29. This provides us for interval (11.41) with

$$I\_{\alpha} = \left[ \Phi^{-1}(\alpha/2) \cdot \widehat{s}\_{q+1}, \ \Phi^{-1}(1 - \alpha/2) \cdot \widehat{s}\_{q+1} \right] = [-0.17, 0.17].$$

**Fig. 11.14** Estimated regression attentions *β <sup>j</sup> (x*<sup>+</sup> *<sup>i</sup> )* of the continuous and binary feature components Area, BonusMalus, log-Density, DrivAge, VehAge, VehGas, VehPower and the two random features *xi,q*+<sup>1</sup> and *xi,q*+<sup>2</sup> of 2'000 randomly selected policies *x*<sup>+</sup> *<sup>i</sup>* ; the orange area shows the interval *Iα* for dropping term *βj (x)xj* on significance level *α* = 0*.*1%

Figure 11.14 shows the estimated regression attentions *β <sup>j</sup> (x*<sup>+</sup> *<sup>i</sup> )* of the continuous and binary feature components for 2'000 randomly selected policies *x*+ *<sup>i</sup>* , and the orange area shows the acceptance region *Iα* on significance level *α* = 0*.*1%. Focusing on the figures of the two additional variables *xi,q*+<sup>1</sup> and *xi,q*+2, Fig. 11.14 (bottom, middle and right), we observe that the estimated regression attentions are mostly within the confidence bounds of *Iα*. This says that we should drop these two terms (of course, this is clear since we have set the bounds according to these regression attentions). Focusing on the other variables, we question the inclusion of the term VehPower as it seems concentrated within *Iα*, and hence we cannot reject the null hypothesis *H*<sup>0</sup> : *β*VehPower*(x)* = 0. Moreover, the inclusion of the term Area needs further exploration.


**Table 11.11** Run times, number of parameters, in-sample and out-of-sample deviance losses (units are in 10−2) and in-sample average frequency of the Poisson regressions, see also Table 7.3

We remind that dropping a term *βj (x)xj* does not necessarily imply that we have to completely drop *xj* because it may still play an important role in one of the other regression attentions *βj (x)*, *j* = *j* . Therefore, we re-run the whole fitting procedure, but we drop the purely random feature components *xi,q*+<sup>1</sup> and *xi,q*+2, and we also drop VehPower and Area to see whether we receive a model with a similar predictive power. This then would imply that we can drop these variables, in the sense of variable selection similar to the LRT and the Wald test of Sect. 5.3. We denote the feature where we drop these components by *<sup>x</sup>*<sup>−</sup> <sup>∈</sup> <sup>R</sup>*q*−2.

We re-fit the LocalGLMnet on the reduced features *x*− *<sup>i</sup>* , and the results are presented in Table 11.11. We observe that the loss figures decrease. Indeed, this supports the null hypothesis of dropping VehPower and Area. The reason for being able to drop VehPower is that it does not contribute (sufficiently) to explain the systematic effects in the responses. The reason for being able to drop Area is slightly different: we have seen that Area and log-Density are highly correlated, see Fig. 13.12 (rhs), and it turns out that it is sufficient to only keep the Density variable (on the log-scale) in the model.

In a next step, we should analyze the robustness of these results by exploring the nagging predictor and/or bootstrapping as described in Sect. 11.4. We refrain from doing so, but we illustrate the LocalGLMnet solution of Table 11.11 in more detail. Figure 11.15 shows the feature contributions *β <sup>j</sup> (x*<sup>−</sup> *<sup>i</sup> )xi,j* of 2'000 randomly selected policies on the significant continuous and binary feature components. The magenta line gives a spline fit, and the more the black dots spread around these splines, the more interactions we have; for instance, higher bonus-malus levels interact with the age of driver which explains the scattering of the black dots. On average, frequencies are increasing in bonus-malus levels and density, decreasing in vehicle age, and for the driver's age variable it is important to understand the interactions. We observe that the spline fit for the log-Density is close to a linear function, this reflects that the regression attentions *β* Density*(xi)* in Fig. 11.14 (top-right) are more or less constant. This is also confirmed by the marginal plot in Fig. 5.4 (bottom-rhs) which has motivated the choice of a linear term for the log-Density in model Poisson GLM1 of Table 5.3.

variables

Using the regression attentions we define an importance measure. We consider the extended features *x*+ in the following numerical analysis. We set

$$\text{IM}\_{\boldsymbol{f}} = \frac{1}{n} \sum\_{i=1}^{n} \left| \widehat{\beta}\_{\boldsymbol{f}} (\mathbf{x}\_{i}^{+}) \right|,$$

for 1 ≤ *j* ≤ *q* + 2, and where we aggregate over all policies 1 ≤ *i* ≤ *n*. Figure 11.16 shows the importance measures IM*j* of the continuous and binary variables *j* . The bars are ordered w.r.t. these importance measures. The graph confirms our previous conclusion, the least important variables are the two additional purely

random components *xi,q*+<sup>1</sup> and *xi,q*+2, followed by Area and VehPower. These are exactly the components that have been dropped going from the full model *x*+ to the reduced model *x*−.

Next, we analyze the interactions by studying the gradients (11.38). Figure 11.17 illustrates spline fits to the components *∂β <sup>j</sup> (x*<sup>−</sup> *<sup>i</sup> )/∂xk* w.r.t. *xj* of the continuous variables BonusMalus, log-Density, DrivAge and VehAge over all policies *i* = 1*,...,n*. The components *∂β <sup>j</sup> (x*<sup>−</sup> *<sup>i</sup> )/∂xj* show the non-linearity in *xj* . We conclude that BonusMalus, DrivAge and VehAge should be non-linear, and log-Density is linear because *∂β <sup>j</sup> (x*<sup>−</sup> *<sup>i</sup> )/∂xj* ≈ 0. The components *∂β <sup>j</sup> (x*<sup>−</sup> *<sup>i</sup> )/∂xk*, *k* = *j* , determine the interactions. We have the strongest interactions between BonusMalus and DrivAge, and BonusMalus has interactions with all variables. On the other hand, the log-Density only interacts with BonusMalus.

The reader will have noticed that we have excluded the categorical components VehBrand and Region from all model discussions. Firstly, these components are not standardized to zero mean and unit variance, and, secondly, we cannot study one level in isolation to be able to decide to keep or drop that variable. I.e., similar to group LASSO we need to study all levels simultaneously of each categorical feature component. We do this in the next section, and we conclude with the regression

**Fig. 11.17** Spline fits to the derivatives *∂β <sup>j</sup> (x*<sup>−</sup> *<sup>i</sup> )/∂xk* w.r.t. *xj* of the continuous variables BonusMalus, log-Density, DrivAge and VehAge over all policies *i* = 1*,...,n*

attentions *β <sup>j</sup> (x)* of the categorical feature components in Fig. 11.18, which seem to be significantly different from zero (VehBrands B10, B11, and Regions R22, R43, R82, R93), but which do not allow for variable selection as just described.

*Remark 11.13* The bias regularization in Table 11.11 has simply been obtained by applying an additional MLE step to *α*<sup>0</sup> and *α*1. Alternatively, we can also define the new features *<sup>z</sup><sup>i</sup>* <sup>=</sup> *( <sup>α</sup>*1*<sup>β</sup>* <sup>1</sup>*(xi)xi,*1*,..., <sup>α</sup>*1*<sup>β</sup> <sup>q</sup>*<sup>0</sup> *(xi)xi,q*<sup>0</sup> *)*- <sup>∈</sup> <sup>R</sup>*q*<sup>0</sup> , and then apply a proper GLM step to these newly (learned) features *<sup>z</sup>*1*,..., <sup>z</sup>n*. Working with the canonical link will give us the balance property. This is discussed in more detail in Remark 3.1 of Richman–Wüthrich [317].

**interactions of feature component Bonus−Malus Level**

**interactions of feature component Density**

**Fig. 11.18** Boxplot of the regression attentions *β <sup>j</sup> (x)* of the categorical feature components VehBrand and Region; the *y*-scale is the same as in Fig. 11.15

## *11.5.4 Variable Selection Through Regularization of the LocalGLMnet*

A natural next step is to introduce regularization on the regression attentions *β(x)*; this is the proposal suggested in Richman–Wüthrich [318]. We choose the LocalGLMnet architecture *x* → *μ(x)* of Definition 11.12 having an intercept parameter *<sup>β</sup>*<sup>0</sup> <sup>∈</sup> <sup>R</sup> and the network weights *<sup>w</sup>*. For fitting, we consider a loss function *L* and we add a regularization term to this loss function penalizing large regression attentions. That is, we aim at minimizing

$$\underset{\beta\_0, \mathbf{w}}{\text{arg min}} \frac{1}{n} \sum\_{i=1}^n L\left(Y\_i, \mu(\mathbf{x}\_i)\right) - \Re(\mathcal{J}(\mathbf{x}\_i)),\tag{11.42}$$

with a penalty term (regularizer) <sup>R</sup>*(*·*)* <sup>≥</sup> 0. For the penalty term <sup>R</sup> we can choose different forms, e.g., the elastic net regularizer of Zou–Hastie [409] is obtained by, see Remark 6.3,

$$\underset{\beta\_{0},\mathbf{w}}{\text{arg min}} \frac{1}{n} \sum\_{l=1}^{n} L\left(Y\_{l},\mu(\mathbf{x}\_{l})\right) + \eta\left((1-\alpha)\|\mathcal{J}(\mathbf{x}\_{l})\|\_{2}^{2} + \alpha\|\mathcal{J}(\mathbf{x}\_{l})\|\_{1}\right),\qquad(11.43)$$

for a regularization parameter *η* ≥ 0 and weight *α* ∈ [0*,* 1]. For *α* = 0 we receive ridge regularization, and for *α* = 1 we get LASSO regularization of *β(*·*)*.

For variable selection of categorical feature components we should rather use the group LASSO penalization of Yuan–Lin [398], see also (6.5). Assume the features *x* have a natural group structure *x* = *(x*- <sup>1</sup> *,..., x*- *K)*- <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* . We consider the optimization

$$\underset{\beta\_0, \mathbf{w}}{\text{arg min}} \frac{1}{n} \sum\_{i=1}^n L\left(Y\_i, \mu(\mathbf{x}\_i)\right) + \sum\_{k=1}^K \eta\_k \|\mathcal{B}\_k(\mathbf{x}\_i)\|\_2,\tag{11.44}$$

for regularization parameters *ηk* ≥ 0, and where *βk(x)* collects all components *βj (x)* of *β(x)* that belong to the *k*-th group *x<sup>k</sup>* of *x*. Yuan–Lin [398] propose to scale the regularization parameters as *ηk* <sup>=</sup> <sup>√</sup>*qkη* <sup>≥</sup> 0, where *qk* is the size of group *k*. Remark that if every group has size one we exactly obtain LASSO regularization.

Solving the optimization problem (11.44) poses some challenges because the regularizer is not differentiable in zero. In Sect. 6.2.5 we have presented the generalized projection operator (using the soft-thresholding operator) to solve the group LASSO regularization within GLMs. However, this proposal will not work here: the generalized projection operator may help to project the regression attentions *β(xi)* back to the constraint set *C*. However, this does not tell us anything about how to choose the network parameters *w* and, therefore, will not work here. In a different setting, Oelker–Tutz [288] propose to use a differentiable approximation to the terms in (11.44). Choose *<sup>&</sup>gt;* 0 and define for *<sup>β</sup><sup>k</sup>* <sup>∈</sup> <sup>R</sup>*qk*

$$\|\|\boldsymbol{\mathfrak{g}}\_{k}\|\_{2,\epsilon} = \sqrt{\|\boldsymbol{\mathfrak{g}}\_{k}\|\_{2}^{2} + \epsilon} = \sqrt{\boldsymbol{\mathfrak{g}}\_{k}^{\top}\boldsymbol{\mathfrak{g}}\_{k} + \epsilon} \quad \rightarrow \quad \|\boldsymbol{\mathfrak{g}}\_{k}\|\_{2} \quad \text{as } \epsilon \downarrow 0. \tag{11.45}$$

This motivates to study the optimization problem for a fixed (small)  *>* 0

$$\underset{\beta\_{0},\boldsymbol{\mu}}{\text{arg min}} \frac{1}{n} \sum\_{i=1}^{n} L\left(Y\_{i},\boldsymbol{\mu}(\mathbf{x}\_{i})\right) + \sum\_{k=1}^{K} \eta\_{k} \|\boldsymbol{\mathcal{B}}\_{k}(\mathbf{x}\_{i})\|\_{2,\varepsilon}.\tag{11.46}$$

In Fig. 11.19 we plot these -approximations for ∈ {10−1*,* <sup>10</sup>−2*,* <sup>10</sup>−3*,* <sup>10</sup>−4*,* <sup>10</sup>−5}. The plot on the left-hand side gives *<sup>β</sup>* <sup>∈</sup> <sup>R</sup> <sup>→</sup>*<sup>β</sup>*<sup>2</sup>*,* <sup>=</sup> *β*<sup>2</sup> + → |*β*| for ↓ 0, and the plot on the right-hand side gives the unit ball

$$\mathcal{B}\_{\epsilon} = \left\{ \mathcal{B} = (\beta\_1, \beta\_2)^{\top} \in \mathbb{R}^2 \colon \|\beta\_1\|\_{2,\epsilon} + \|\beta\_2\|\_{2,\epsilon} = 1 \right\}.$$

For the last two choices there is no visible difference to the 1-norm.

**Fig. 11.19** (lhs) Comparison of <sup>|</sup>*β*<sup>|</sup> and *<sup>β</sup>*<sup>2</sup>*,* <sup>=</sup> *<sup>β</sup>*<sup>2</sup> <sup>+</sup> for *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>, and (rhs) unit balls *<sup>B</sup>* for ∈ {10−1*,* <sup>10</sup>−2*,* <sup>10</sup>−3*,* <sup>10</sup>−4*,* <sup>10</sup>−5} compared to the Manhattan unit ball

The main disadvantage of the -approximation is that it does not shrink unimportant components *βj (x)* exactly to zero. But it allows us to identify unimportant (small) components, which can then be removed manually. As mentioned in Lee et al. [237], LASSO regularization needs a second model calibration step only fitting the model on the selected components (and without regularization) to receive an optimal predictive power and a minimal bias. Thus, we need a second calibration step after the removal of the unimportant components anyway.

## *11.5.5 Lab: LASSO Regularization of LocalGLMnet*

We revisit the LocalGLMnet architecture applied to the French MTPL claim frequency data, see Sect. 11.5.3. The goal is to perform a group LASSO regularization so that we can also study the importance of the terms coming from the categorical feature components VehBrand and Region. We first pre-process all feature components as follows. We apply dummy coding to the categorical variables, and then we standardize all components to centering and unit variance, this includes the dummy coded components.

In a next step we need to define the natural groups *x* = *(x*- <sup>1</sup> *,..., x*- *K)*- <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* . We have 7 continuous and binary components which give us dimensions *qk* = 1 for 1 ≤ *k* ≤ 7. VehBrand provides us with a group of size *q*<sup>8</sup> = 10, and Region gives us a group of size *<sup>q</sup>*<sup>9</sup> <sup>=</sup> 21. We set *<sup>K</sup>* <sup>=</sup> 9 and *<sup>q</sup>* <sup>=</sup> <sup>9</sup> *<sup>k</sup>*=<sup>1</sup> *qk* <sup>=</sup> 38. We code

**Listing 11.9** Group LASSO regularization design

```
1 group.lasso.grouping <- function(xx){
2 pp <- array(0, dim=c(length(xx),sum(xx)))
3 for (k in 1:length(xx)){
4 if (k==1){pp[k,1:xx[k]] <- 1
5 }else{
6 pp[k,(sum(xx[1:(k-1)])+1):sum(xx[1:k])] <- 1
7 }}
8 t(pp)
9 }
10 #
11 ww <- group.lasso.grouping(c(rep(1,7),10,21)) 12 etaK <- eta
12 etaK <- eta * sqrt(c(rep(1,7),10,21))
```
a (sort of) regularization design matrix to encode the *<sup>K</sup>* groups and weights <sup>√</sup>*qk* for the *q* components of *x*. This is done in Listing 11.9 providing us with a matrix of size 38 <sup>×</sup> 9 and the weights <sup>√</sup>*qk*. This regularization design matrix enters the penalty term on lines 13 and 16 of Listing 11.10 which weights the penalizations ·2*,* .

**Listing 11.10** LocalGLMnet with group LASSO regularization

```
1 Design = layer_input(shape = c(38), dtype = 'float32')
2 LogVol = layer_input(shape = c(1), dtype = 'float32')
3 Bias1 = layer_input(shape = c(1), dtype = 'float32')
4 #
5 Attention = Design %>%
6 layer_dense(units=15, activation='tanh') %>%
7 layer_dense(units=10, activation='tanh') %>%
8 layer_dense(units=38, activation='linear', name='Attention')
9 #
10 Penalty = Attention %>%
11 layer_lambda(function(x) k_square(x)) %>%
12 layer_dense(units=9, activation='linear',
13 weights=list(ww), use_bias=FALSE, trainable=FALSE) %>%
14 layer_lambda(function(x) k_sqrt(x+epsilon)) %>%
15 layer_dense(units=1, activation='linear',
16 weights=list(array(etaK, dim=c(9,1))), use_bias=FALSE, trainable=FALSE)
17 #
18 LocalGLM = list(Design, Attention) %>% layer_dot(axes=1)
19 #
20 Bias = Bias1 %>%
21 layer_dense(units=1, activation='linear', use_bias=FALSE)
22 #
23 Response = list(LocalGLM, Bias, LogVol) %>% layer_add() %>%
24 layer_lambda(function(x) k_exp(x))
25 #
26 Output = list(Response, Penalty) %>% layer_concatenate()
27 #
28 keras_model(inputs = c(Design, LogVol, Bias1), outputs = c(Output))
```
The entire group LASSO regularized LocalGLMnet is depicted in Listing 11.10, showing the regression attentions on lines 5–8, the regularization on lines 10–16, and the output on line 26 returns the expected response *viμ(xi)* and the regularizer *K <sup>k</sup>*=<sup>1</sup> *ηk<sup>β</sup>k(xi)*<sup>2</sup>*,* , we choose <sup>=</sup> <sup>10</sup>−<sup>5</sup> for our example.

**Listing 11.11** Group LASSO regularized Poisson deviance loss


Finally, we need to code the loss function (11.42). This is done in Listing 11.11. We combine the Poisson deviance loss function with the group LASSO -approximation *K <sup>k</sup>*=<sup>1</sup> *ηk<sup>β</sup>k(xi)*<sup>2</sup>*,* , the latter being outputted by Listing 11.10. We fit this network to the French MTPL data (as above) for regularization parameters *η* ∈ {0*,* 0*.*0025*,* 0*.*005}. Firstly, we note that the resulting networks are not fully competitive, this is probably due to the fact that the high-dimensional dummy coding leads to too much over-fitting potential which leads to a very early stopping in gradient descent fitting. Thus, this approach may not be useful to directly receive a good predictive model, but it may be helpful to select the right feature components to design a good predictive model.

Figure 11.20 gives the importance measures of the estimated regression attentions

$$\text{IM}\_{\hat{f}} = \frac{1}{n} \sum\_{i=1}^{n} \left| \widehat{\beta}\_{\hat{f}}(\mathbf{x}\_{i}) \right|,$$

of all components 1 ≤ *j* ≤ *q* = 38. The red color corresponds to regularization parameter *η* = 0*.*005, red + yellow colors to *η* = 0*.*0025, and red + yellow + green colors to *η* = 0 (no regularization). Figure 11.20 (lhs) shows the results on the original (standardized) features *x*. By far the smallest red + yellow column among the continuous features is observed for VehPower which confirms the variable selection of Sect. 11.5.3. Among the categorical variables Region seems more important (on average) than VehBrand because the red and yellow columns are generally bigger for Region. All these red and yellow columns of VehBrand and Region are bigger than the ones of VehPower which supports the inclusion of the two categorical variables.

Figure 11.20 (rhs) verifies this decision of keeping the categorical variables. For this latter graph we randomly permute Region across the entire portfolio, and we run the same group LASSO regularized fitting procedure again on this modified data. The vertical black line shows the average importance of the permuted Region variable for *η* = 0*.*0025. We see that only VehPower has a smaller importance measure, and all other variables dominate the permuted Region variable. This confirms our conclusions above.

**Fig. 11.20** Importance measures IM*<sup>j</sup>* of the group LASSO regularized LocalGLMnet for variable selection with different regularization parameters *η* ∈ {0*,* 0*.*0025*,* 0*.*005}: (lhs) original data, and (rhs) randomly permuted Region labels; the *x*-scale is the same in both plots

We conclude that the LocalGLMnet architecture with a group LASSO regularization is helpful for variable selection, and, more generally, the LocalGLMnet architecture is useful for model interpretation, finding interactions and functional forms of the features entering the regression function. In examples that have categorical variables with many levels, the LocalGLMnet approach may not lead to a regression model that is fully competitive. In this case, the LocalGLMnet can be used for variable selection, and an other network architecture should then be fitted on the selected variables. Alternatively, we can embed the categorical variables in a preparatory network step, and then work with these embeddings of the categorical variables (kept fixed within the LocalGLMnet).

## **11.6 Selected Applications**

## *11.6.1 Mixture Density Networks*

In Sect. 6.3 we have introduced mixture distributions and we have presented the EM algorithm for fitting these mixture distributions. The EM algorithm considers two steps, an expectation step (E-step) and a maximization step (M-step). The E-step is motivated by (6.34). In this step the posterior distribution of the latent variable *Z* is determined, given the observation *Y* and the parameter estimates for the model parameters *θ* and *p*. The M-step (6.35) determines the optimal model parameters *θ* and *p*, based on the observation *Y* and the posterior distribution of *Z*. Typically, we explore MLE in the M-step. However, for the EM algorithm to function it is not important that we really work with the maximum in the M-step, but monotonicity in (6.38) is sufficient. Thus, if at algorithmic time *t* − 1 we have a parameter estimate *( <sup>θ</sup> (t*−1*) , <sup>p</sup>(t*−1*) )*, it suffices that the next estimate *( <sup>θ</sup> (t ), <sup>p</sup>(t ))* increases the log-likelihood, without necessarily being the MLE; this latter approach is called generalized EM (GEM) algorithm. Exactly this point makes it feasible to also use the EM algorithm in cases where we model the parameters through networks which are fit using gradient descent (ascent) algorithms. These methods go under the name of mixture density networks (MDNs).

MDNs have been introduced by Bishop [35], who explores MDNs on Gaussian mixtures, and using SGD and quasi-Newton methods for model fitting. MDNs have also started to gain more popularity within the actuarial community, recent papers include Delong et al. [95], Kuo [230] and Al-Mudafer et al. [6], the latter two considering MDNs for claims reserving.

We recall the mixture density for a selected member of the EDF. The incomplete log-likelihood of the data *(Yi, xi, vi)*<sup>1</sup>≤*i*≤*<sup>n</sup>* is given by, see (6.24),

$$\begin{aligned} \ell(\boldsymbol{\theta}, \boldsymbol{\uprho}, \boldsymbol{\uprho}) &\mapsto \ \ell\_Y(\boldsymbol{\theta}, \boldsymbol{\uprho}, \boldsymbol{\uprho}) = \sum\_{l=1}^n \ell\_{Y\_l}(\boldsymbol{\theta}(\boldsymbol{x}\_l), \boldsymbol{\uprho}(\boldsymbol{x}\_l), \boldsymbol{\uprho}(\boldsymbol{x}\_l)) \\ &= \sum\_{l=1}^n \log \left( \sum\_{k=1}^K p\_k(\mathbf{x}\_l) f\_k \left( Y\_l; \boldsymbol{\uptheta}\_k(\mathbf{x}\_l), \frac{\boldsymbol{\uprho}}{\boldsymbol{\uprho}\_k(\mathbf{x}\_l)} \right) \right), \end{aligned}$$

for canonical parameter *θ* = *(θ*1*,...,θK)*- ∈ = <sup>1</sup> ×···× *K*, dispersion parameter *ϕ* = *(ϕ*1*,...,ϕK)*- <sup>∈</sup> <sup>R</sup>*<sup>K</sup>* <sup>+</sup>, mixture probability *p* ∈ *K*, and *K* denotes the number of mixture components. MDNs model these parameters with networks. Choose a FN network *<sup>z</sup>(d*:1*)* : <sup>R</sup>*q*+<sup>1</sup> → {1}×R*qd* of depth *<sup>d</sup>*, with input dimension *<sup>q</sup>* being equal to the dimension of the features *<sup>x</sup>* <sup>∈</sup> *<sup>X</sup>* ⊆ {1}×R*<sup>q</sup>* and output dimension *qd* <sup>+</sup> 1. This gives us the learned representations *<sup>z</sup><sup>i</sup>* <sup>=</sup> *<sup>z</sup>(d*:1*) (xi)*. These learned representations are used to model the parameters. For the mixture probability *p* we build a logistic categorical GLM, based on *zi*. For the (canonical) link *h*, we set linear predictor, see (5.72),

$$h\left(h(\mathbf{p}(\mathbf{z}\_{l})) = h\left(\mathbf{p}\left(\mathbf{z}^{(d:\mathbf{l})}(\mathbf{x}\_{l})\right)\right) = \left(\langle\boldsymbol{\theta}\_{l}^{\mathbf{p}},\mathbf{z}\_{l}\rangle,\ldots,\langle\boldsymbol{\theta}\_{K}^{\mathbf{p}},\mathbf{z}\_{l}\rangle\right)^{\top} \in \mathbb{R}^{K},\qquad(11.47)$$

with regression parameter *<sup>β</sup><sup>p</sup>* <sup>=</sup> *((β<sup>p</sup>* <sup>1</sup> *)*-*,...,(β<sup>p</sup> K)*-*)*- <sup>∈</sup> <sup>R</sup>*K(qd*+1*)* . For the canonical parameter *θ*, the mean parameter *μ*, respectively, and the dispersion parameter *ϕ* we proceed analogously. Choose strictly monotone and smooth link functions *gμ* and *gϕ*, and consider the double GLMs, for 1 ≤ *k* ≤ *K*, on the learned representations *z<sup>i</sup>*

$$g\_{\mu}(\mu\_k(\underline{z}\iota)) = \langle \mathfrak{g}\_k^{\mu}, \underline{z}\iota \rangle \qquad \text{and} \qquad g\_{\varphi}(\varphi\iota(\underline{z}\iota)) = \langle \mathfrak{g}\_k^{\varphi}, \underline{z}\iota \rangle,\tag{11.48}$$

with regression parameters *<sup>β</sup><sup>μ</sup>* <sup>=</sup> *((β<sup>μ</sup>* <sup>1</sup> *)*-*,...,(β<sup>μ</sup> K)*-*)*- <sup>∈</sup> <sup>R</sup>*K(qd*+1*)* for the mean parameters and *<sup>β</sup><sup>ϕ</sup>* <sup>=</sup> *((β<sup>ϕ</sup>* <sup>1</sup> *)*-*,...,(β<sup>ϕ</sup> K)*-*)*- <sup>∈</sup> <sup>R</sup>*K(qd*+1*)* for the dispersion parameters. Thus, altogether this gives us a network parameter of dimension, set *q*<sup>0</sup> = *q*,

$$r = \sum\_{m=1}^{d} q\_m(q\_{m-1} + 1) + 3K(q\_d + 1).$$

#### *Remarks 11.14*


sufficiently large, so that it can comply simultaneously with these different tasks. Alternatively, we could choose three separate (parallel) networks for *p*, *μ* and *ϕ*, respectively. This second proposal does not (easily) allow for (non-trivial) interactions between the parameters, and it may also suffer from less robustness in fitting.


*Example 11.15 (Gamma Claim Size Modeling and MDNs)* We revisit Example 6.14 which models the claim sizes of the French MTPL data. For the modeling of these claim sizes we choose the mixture distribution (6.39) which has four gamma components *f*1*,...,f*<sup>4</sup> and one Lomax component *f*5. In a first step we again model these five mixture components independent of the feature information *x*, and the feature information only enters the mixture probabilities *p(x)* ∈ 5. This modeling approach has been motivated by Fig. 13.17 which suggests that the features mainly result in systematic effects on the mixture probabilities. We choose the same model and feature information as in Example 6.14. We only replace the logistic categorical GLM part (6.40) for modeling *p(x)* by a depth *d* = 2 FN network with *(q*1*, q*2*)* = *(*20*,* 10*)* neurons. Area, VehAge, DrivAge and BonusMalus are modeled as continuous variables, and for the categorical variables VehBrand and Region we choose two-dimensional embedding layers.

**Listing 11.12** R code of the MDN for modeling the mixture probability *p(x)*

```
1 Design = layer_input(shape = c(4), dtype = 'float32')
2 VehBrand = layer_input(shape = c(1), dtype = 'int32')
3 Region = layer_input(shape = c(1), dtype = 'int32')
4 Bias = layer_input(shape = c(1), dtype = 'float32')
5 #
6 BrandEmb = VehBrand %>%
7 layer_embedding(input_dim = 11, output_dim = 2, input_length = 1) %>%
8 layer_flatten()
9 RegionEmb = Region %>%
10 layer_embedding(input_dim = 22, output_dim = 2, input_length = 1) %>%
11 layer_flatten()
12 #
13 pp = list(Design, BrandEmb, RegionEmb) %>% layer_concatenate() %>%
14 layer_dense(units=20, activation='tanh') %>%
15 layer_dense(units=10, activation='tanh') %>%
16 layer_dense(units=5, activation='softmax')
17 #
18 mu = Bias %>% layer_dense(units=4, activation='exponential',
19 use_bias=FALSE)
20 #
21 tail = Bias %>% layer_dense(units=1, activation='sigmoid',
22 use_bias=FALSE)
23 #
24 shape = Bias %>% layer_dense(units=4, activation='exponential',
25 use_bias=FALSE)
26 #
27 Response = list(pp, mu, tail, shape) %>% layer_concatenate()
28 #
29 keras_model(inputs = c(Design, VehBrand, Region, Bias), outputs = c(Response))
```
Listing 11.12 shows the chosen network. Lines 13–16 model the mixture probability *p(x)*. We also integrate the modeling of the (homogeneous) parameters of the mixture densities *f*1*,...,f*5. Lines 18 and 24 of Listing 11.12 consider the mean and shape parameter of the gamma components, and line 21 the tail parameter 1*/β*<sup>5</sup> of the Lomax component. Note that we use the sigmoid activation for this Lomax parameter. This implies 1*/β*<sup>5</sup> ∈ *(*0*,* 1*)* and, thus, *β*<sup>5</sup> *>* 1, which enforces a finite mean model. The exponential activations on lines 18 and 24 ensure positivity of these parameters. The input Bias to these variables is simply the constant 1, which is the homogeneous case not differentiating w.r.t. the features.

Observe that in most of the networks so far, the output of the network was equal to an expected response of a random variable that we try to predict. In this MDN we output the parameters of a distribution function, see line 27 of Listing 11.12. In our case this output has dimension 14, which then enters the score in Listing 11.13. In a first attempt we fit this MDN brute-force by just implementing the incomplete log-likelihood received from (6.39). Since the gamma function *(*·*)* is not easily available in keras [77], we replace the gamma density by its saddlepoint approximation, see Sect. 5.5.2. Listing 11.13 shows the negative loglikelihood of the mixture density that is used to perform the brute-force SGD fitting.


Lines 2–9 give the saddlepoint approximations to the four gamma components, and line 10 the Lomax component for the scale parameter *M*. Note that this brute-force approach is based only on the incomplete observation *Y* encoded in true[,1], see Listing 11.13.

We fit this logistic categorical FN network of Listing 11.12 under the score function of Listing 11.13 using the nadam version of SGD. Moreover, we use a stratified training-validation split, otherwise we did not obtain a competitive model. The results are presented in Table 11.12 on line 'logistic FN network: brute-force fitting'. We observe a slightly worse performance (in-sample) than in the logistic GLM. This does not justify the use of the more complex network architecture. Or in other words, feature pre-processing seems to been done suitably in Example 6.14.

In a next step, we fit this MDN with the (generalized) EM algorithm. The Estep is exactly the same as in Example 6.14. For the M-step, having knowledge of the (latent mixture component) variables *<sup>Z</sup>i*, 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*, implies that the mixture probability estimation and the mixture density estimation completely decouples. As a consequence, the parameters of the density components *f*1*,...,f*<sup>5</sup> can directly be estimated using univariate MLEs, this is the same as in Example 6.14. The only part that needs further explanation is the estimation of the logistic categorical FN network for *p(x)*. In each loop of the EM iteration we would like to find the optimal network parameter for *p(x)*, and at the same time we have to ensure the monotonicity (6.38). Following the 'EM forward network' approach of Delong et


**Table 11.12** Mixture models for French MTPL claim size modeling; we set *M* = 2 000

al. [95], this is most easily achieved by just initializing the FN network in loop *t* of the algorithm with the optimal network parameter of the previous loop *t* − 1. Thus, the starting parameter of SGD reflects the optimal parameter from the previous step, and since SGD generally decreases losses, the monotonicity (6.38) holds. The latter statement is not strictly true, SGD introduces additional randomness through the building of (mini-)batches, therefore, monotonicity should be traced explicitly (which also ensures that the early stopping rule is chosen suitably). We have implemented such an EM-SGD algorithm, essentially, we just have to drop lines 17–28 of Listing 11.12 and lines 13–16 provide the entire response. As loss function we choose the categorical (multi-class) cross-entropy loss, see (4.19). The results in Table 11.12 on line 'logistic FN network: EM fitting' indicate a superior fitting behavior compared to the brute-force fitting. Nevertheless, this network approach is still not outperforming the GLM approach, saying that we should stay with the simpler GLM.

In a final step, we also model the mean parameters *μk(x)*, 1 ≤ *k* ≤ 4, of the gamma components feature dependent, to see whether we can gain predictive power from this additional flexibility or whether our initial model choice is sufficient. For robustness reasons we neither model the shape parameters *βk* , 1 ≤ *k* ≤ 4, of the gamma components feature dependent nor the tail parameter *β*<sup>5</sup> of the Lomax component. The implementation only requires small changes to Listing 11.12, see Listing 11.14.

A brute-force fitting of the MDN architecture of Listing 11.14 can directly be based on the score function (negative incomplete log-likelihood) of Listing 11.13. In the case of the EM algorithm we need to change the score function to the complete log-likelihood accounting for the variables *<sup>Z</sup><sup>i</sup>* <sup>∈</sup> 5. This is done in Listing 11.15 where *<sup>Z</sup><sup>i</sup>* is encoded in the variables true[,2] to true[,6].

We fit this MDN using the two different fitting approaches, and the results are given on the last two lines of Table 11.12. Again the performance of the EM fitting is slightly better than the brute-force fitting, and the bigger log-likelihoods indicate that we can gain predictive power by also modeling the means of the gamma components feature dependent.

Figure 11.21 compares the QQ plot of the resulting MDN with EM fitting to the one received from the logistic categorical GLM of Example 6.14. These graphs are very similar. We conclude that in this particular example it seems that the simpler proposal of Example 6.14 is sufficient. -

In a next step, we try to understand which feature components influence the mixture probabilities *p(x)* = *(p*1*(x), . . . , pK(x))* most. Similarly to Examples 6.14 and 11.15, we therefore use a MDN where we only fit the mixture probability *p(x)* with a network and the mixture components *f*1*,...,fK* are assumed to be homogeneous.

*Example 11.16 (MDN with LocalGLMnet)* We revisit Example 11.15. We choose the mixture distribution (6.39) which has four gamma components *f*1*,...,f*<sup>4</sup> and a Lomax component *f*5. We select their parameters independent of the features. The feature information *x* should only enter the mixture probability *p(x)* ∈ 5, similarly to the first part of Example 11.15. We replace the logistic FN network of

```
Listing 11.14 R code of the MDN for modeling the mixture probability p(x) and the gamma
means μk(x)
```

```
1 Design = layer_input(shape = c(4), dtype = 'float32')
2 VehBrand = layer_input(shape = c(1), dtype = 'int32')
3 Region = layer_input(shape = c(1), dtype = 'int32')
4 Bias = layer_input(shape = c(1), dtype = 'float32')
5 #
6 BrandEmb = VehBrand %>%
7 layer_embedding(input_dim = 11, output_dim = 2, input_length = 1) %>%
8 layer_flatten()
9 RegionEmb = Region %>%
10 layer_embedding(input_dim = 22, output_dim = 2, input_length = 1) %>%
11 layer_flatten()
12 #
13 Network = list(Design, BrandEmb, RegionEmb) %>% layer_concatenate() %>%
14 layer_dense(units=20, activation='tanh') %>%
15 layer_dense(units=15, activation='tanh') %>%
16 layer_dense(units=10, activation='tanh')
17 #
18 pp = Network %>% layer_dense(units=5, activation='softmax')
19 #
20 mu = Network %>% layer_dense(units=4, activation='exponential',
21 use_bias=FALSE)
22 #
23 tail = Bias %>% layer_dense(units=1, activation='sigmoid',
24 use_bias=FALSE)
25 #
26 shape = Bias %>% layer_dense(units=4, activation='exponential',
27 use_bias=FALSE)
28 #
29 Response = list(pp, mu, tail, shape) %>% layer_concatenate()
30 #
31 keras_model(inputs = c(Design, VehBrand, Region, Bias), outputs = c(Response))
```
**Listing 11.15** Mixture density negative complete log-likelihood


Example 11.15 for modeling *p(x)* by a LocalGLMnet such that we can analyze the importance of the variables, see Sect. 11.5.

For the feature information we choose the continuous variables Area, VehPower, VehAge, DrivAge and BonusMalus, the binary variable VehGas and the categorical variables VehBrand and Region, thus, we extend by VehPower and VehGas compared to Example 11.15. These latter two variables have not been included previously, because they did not seem to be important

**Fig. 11.21** QQ plots of mixture models: (lhs) logistic categorical GLM for mixture probabilities and (rhs) for MDN with EM fitting

w.r.t. Fig. 13.17. The continuous and binary variables are centered and normalized to unit variance. For the categorical variables we use two-dimensional embedding layers, and afterwards they are concatenated with the continuous variables with a subsequent normalization layer (to ensure that all components live on the same scale). This provides us with a 10-dimensional feature vector. This feature vector is complemented with an i.i.d. standard Gaussian component, called Random, to perform an empirical Wald type test. We call this pre-processed feature (after embedding and normalization of the categorical variables) *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*q*<sup>0</sup> with *<sup>q</sup>*<sup>0</sup> <sup>=</sup> 11.

We design a LocalGLMnet that acts on this feature *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*q*<sup>0</sup> for modeling a categorical multi-class output with *K* = 5 levels. Therefore, we choose the regression attentions

$$\mathbf{z}^{(d:\mathbb{I})}: \mathbb{R}^{q\_0} \to \mathbb{R}^{q\_0 \times K}, \qquad \mathbf{x} \mapsto \boldsymbol{\mathcal{J}}(\mathbf{x}) = \left(\boldsymbol{\mathcal{J}}\_1(\mathbf{x}), \dots, \boldsymbol{\mathcal{J}}\_K(\mathbf{x})\right) = \mathbf{z}^{(d:\mathbb{I})}(\mathbf{x}),$$

where *z(d*:1*)* is a network of depth *d* having a matrix-valued output of dimension *q*<sup>0</sup> × *K*. For the (canonical) link *h*, this gives us the predictor, see (5.72),

$$h(\boldsymbol{\mathfrak{p}}(\mathbf{x})) = \left(\boldsymbol{\beta}\_{1,0} + \langle \boldsymbol{\beta}\_1(\mathbf{x}), \mathbf{x} \rangle, \dots, \boldsymbol{\beta}\_{K,0} + \langle \boldsymbol{\beta}\_K(\mathbf{x}), \mathbf{x} \rangle\right)^{\top} \in \mathbb{R}^K,\qquad(11.49)$$

with intercepts *βk,*<sup>0</sup> <sup>∈</sup> <sup>R</sup>, and where *<sup>β</sup>k(x)* <sup>∈</sup> <sup>R</sup>*q*<sup>0</sup> is the *<sup>k</sup>*-th column of regression attention *<sup>β</sup>(x)* <sup>=</sup> *<sup>z</sup>(d*:1*) (x)* <sup>∈</sup> <sup>R</sup>*q*0×*K*. We also refer to the second item of Remarks 11.14 concerning a possible dimension reduction in (11.49), i.e., in fact we apply the softmax activation function to the right-hand side of (11.49), neglecting the identifiability issue. Moreover, as in the introduction of the LocalGLMnet, we separate the intercept components from the remaining features in (11.49).

We fit this LocalGLMnet-MDN with the EM version presented in Example 11.15. We apply early stopping based on the same stratified training-validation split as in the aforementioned example, and this provides us with a log-likelihood of -198'290, thus, slightly bigger than the corresponding numbers in Table 11.12. More interestingly, our goal is to understand the regression attentions given by *<sup>β</sup>(xi)* <sup>=</sup> *(β*1*(xi), . . . , <sup>β</sup>*5*(xi))* <sup>∈</sup> <sup>R</sup>11×<sup>5</sup> over all claims 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*. Figure 11.22 shows the resulting boxplots, where each of the five graphs corresponds to one mixture component 1 ≤ *k* ≤ 5, and the different colors illustrate the 11 feature components providing the attention weights *βk,j (xi)*, 1 ≤ *j* ≤ 11. The red boxplots show the purely random component Random for 1 ≤ *k* ≤ 5, which provides the acceptance region of an empirical Wald test for the null hypothesis that the corresponding term should be dropped. This is highlighted by the orange shaded area (at a significance level of 0.1%). Thus, whenever a boxplot lies within this orange shaded area we may consider dropping this term, e.g., for *k* = 2 (top-right), this is the case for Area, VehPower and Region2 (being the second component of the two-dimensional region embedding). Note that this interpretation needs some care because we do not have identifiability in the class probabilities.

The first observation is that, indeed, VehPower is mostly in the orange confidence area and, thus, may be dropped. This does not apply to the other feature components, and, thus, we should keep them in the model. The three gamma mixture components *f*1, *f*<sup>2</sup> and *f*<sup>3</sup> correspond to the three modes at 75, 600 and 1'175 in Fig. 13.17. Component *f*<sup>4</sup> is a gamma component covering the whole range of claims, and *f*<sup>5</sup> is the Lomax component modeling the regular variation in the tail. Interestingly, DrivAge and BonusMalus seem very important for mixture components *k* = 1, *k* = 3 and *k* = 4 (with different signs), this is supported by Fig. 13.17. The Lomax component seems mostly impacted by DrivAge, VehBrand and Region. Only mixture component *k* = 2 is more difficult to interpret. This component seems influenced by most the feature components, in particular, the combination of VehAge, VehGas and VehBrand seems important. This could mean that mixture component *k* = 2 belongs to a certain type of vehicle.

In a next step we could study interactions and their impact on the mixture components, and LASSO regularization would provide us with another method of variable selection, see Sect. 11.5.4. We refrain from doing so and close the example.

## *11.6.2 Estimation of Conditional Expectations*

FN networks have also found their way into solving risk management problems. We briefly introduce a valuation problem and then describe a way of solving this problem. Assume we have a liability cash flow *Y*1:*<sup>T</sup>* = *(Y*1*,...,YT )* with (random) payments *Yt* at time points *t* = 1*,...,T* . We assume that this liability cash flow *Y*1:*<sup>T</sup>* is adapted to a filtration *(At)*<sup>1</sup>≤*t*≤*<sup>T</sup>* on the underlying probability space *(-, <sup>A</sup>,* <sup>P</sup>*)*. Moreover, we assume to have a pricing kernel (state price deflator) *ψ*1:*<sup>T</sup>* = *(ψ*1*,...,ψT )* on that probability space which is an *(At)*<sup>1</sup>≤*t*≤*<sup>T</sup>* -adapted


**Fig. 11.22** Boxplot of regression attentions *<sup>β</sup>(xi)* <sup>=</sup> *(β*1*(xi), . . . , <sup>β</sup>*5*(xi))* <sup>∈</sup> <sup>R</sup>11×<sup>5</sup> over all claims 1 ≤ *i* ≤ *n* for the different mixture components *f*1*,...,f*<sup>5</sup>

random vector with strictly positive components *ψt >* 0, a.s., for all 1 ≤ *t* ≤ *T* . A no-arbitrage value of the outstanding liability cash flow at time 1 ≤ *τ <T* can be defined by (we assume existence of all second moments)

$$\mathcal{R}\_{\mathbf{r}} = \sum\_{s=\mathbf{r}+1}^{T} \frac{1}{\psi\_{\mathbf{r}}} \mathbb{E} \left[ \left. \psi\_{s} Y\_{s} \right| \mathcal{A}\_{\mathbf{r}} \right]. \tag{11.50}$$

For the mathematical background on no-arbitrage pricing using state price deflators we refer to Wüthrich–Merz [393]. The *A<sup>τ</sup>* -measurable quantity *R<sup>τ</sup>* is called reserves of the outstanding liabilities at time *τ* . From a risk management and solvency point of view we would like to understand the volatility in the reserves *R<sup>τ</sup>* seen from time 0, i.e., we try to model the random variable *R<sup>τ</sup>* seen from time 0 (based on the trivial *σ*-algebra *A*<sup>0</sup> = {∅*, -*}). In applied problems, the difficulty often is that the conditional expectations under the summation in (11.50) cannot be computed in closed form. Therefore the law of *R<sup>τ</sup>* cannot be determined explicitly.

We provide a numerical solution to the calculation of the conditional expectations in (11.50). Assume that the information set *A<sup>τ</sup>* can be described by a random vector *X<sup>τ</sup>* , i.e., *A<sup>τ</sup>* = *σ (X<sup>τ</sup> )*. In that case we rewrite (11.50) as follows

$$\mathcal{R}\_{\mathbf{t}} = \sum\_{s=\mathbf{r}+1}^{T} \frac{1}{\psi\_{\mathbf{r}}} \mathbb{E} \left[ \left. \psi\_{s} Y\_{s} \right| X\_{\mathbf{r}} \right]. \tag{11.51}$$

The latter now indicates that we can determine the conditional expectations in (11.51) as regression functions in features *X<sup>τ</sup>* , and we try to understand for *s>τ*

$$\mathbf{x}\_{\tau} \mapsto \mathbb{E}\left[\left.\frac{\boldsymbol{\psi}\_{\mathcal{S}}}{\boldsymbol{\psi}\_{\mathcal{T}}} \boldsymbol{Y}\_{\mathcal{S}}\right| \boldsymbol{X}\_{\mathcal{T}} = \boldsymbol{x}\_{\mathcal{T}}\right].\tag{11.52}$$

The random variable *R<sup>τ</sup>* can then be determined empirically by simulation. This requires two steps: (1) We have to be able to simulate *ψsYs/ψτ* , conditionally given *X<sup>τ</sup>* = *x<sup>τ</sup>* . This allows us to estimate the conditional expectation (11.52) with a regression function. (2) We need to be able to simulate *X<sup>τ</sup>* . This provides us with the empirical occurrence probabilities of specific choices *X<sup>τ</sup>* = *x<sup>τ</sup>* in (11.52) which then gives an empirical version of *R<sup>τ</sup>* .

In theory, this problem can be approached by nested simulations which is a two-stage procedure that first performs step (2), and then calculates step (1) empirically with Monte Carlo simulations for every realization of step (2), see, e.g., Lee [242] and Glynn–Lee [161]. The disadvantage of this two-stage nested simulation procedure is that it is computationally demanding. Building upon the work on valuation of American options by Carriere [65], Tsitsiklis–Van Roy [356] and Longstaff–Schwartz [257], the papers of Broadie et al. [55] and Ha–Bauer [177] propose to regress future cash flows on finitely many basis functions depending on the state variable *X<sup>τ</sup>* . More recently, machine learning tools such as FN networks have been proposed to determine these basis and regression functions, see, e.g., Cheridito et al. [74] or Krah et al. [224].

In the following, we assume that all random variables considered are squareintegrable and, thus, we can work in a Hilbert space with the scalar product *X, Z*<sup>=</sup> <sup>E</sup>[*XZ*] for *X, Z* <sup>∈</sup> *<sup>L</sup>*2*(-, <sup>A</sup>,* <sup>P</sup>*)*. Moreover, for simplicity, we drop the time indices and we also drop the stochastic discounting in (11.52) by assuming *ψs/ψτ* ≡ 1. These simplifications are not essential technically and simplify our outline. The conditional expectation *μ(X)* <sup>=</sup> <sup>E</sup>[*<sup>Y</sup>* <sup>|</sup>*X*] can then be found by the orthogonal projection of *Y* onto the sub-space *σ (X)*, generated by *X*, in the Hilbert space *<sup>L</sup>*2*(-, <sup>A</sup>,* <sup>P</sup>*)*. That is, the conditional expectation is the measurable function *<sup>μ</sup>* : <sup>R</sup>*<sup>q</sup>* <sup>→</sup> <sup>R</sup>, *<sup>X</sup>* <sup>→</sup> *μ(X)*, that minimizes the mean squared error

$$\mathbb{E}\left[\left(Y-\mu(X)\right)^{2}\right] \stackrel{!}{=} \text{min},\tag{11.53}$$

among all measurable functions on *X*. In Example 3.7, we have seen that *μ(*·*)* is the minimizer of this problem if and only if

$$\mu(\mathbf{x}) = \operatorname\*{arg\,min}\_{m \in \mathbb{R}} \int\_{\mathbb{R}} (\mathbf{y} - m)^2 \, dF\_{Y|\mathbf{x}}(\mathbf{y}),\tag{11.54}$$

for *<sup>p</sup>x*-a.e. *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* , where *<sup>p</sup><sup>x</sup>* is the distribution of *<sup>X</sup>*, and where *FY* <sup>|</sup>*<sup>x</sup>* is the conditional distribution of *Y* , given feature *X* = *x*; we also refer to (3.6).

Under the assumption that we can simulate observations *(Y, X)* under P, we can solve (11.53)–(11.54) approximately by restricting to a sufficiently rich family of regression functions. Choose a FN network *<sup>z</sup>(d*:1*)* : <sup>R</sup>*<sup>q</sup>* <sup>→</sup> <sup>R</sup>*qd* of depth *<sup>d</sup>* and the identity link *g(x)* <sup>=</sup> *<sup>x</sup>*. An optimal network parameter *<sup>ϑ</sup>* is found by minimizing

$$\widehat{\mathfrak{Y}} = \underset{\mathfrak{\mathfrak{\mathfrak{G}} \in \mathbb{R}^{\prime}}{\arg \min}}{\arg \min} \frac{1}{n} \sum\_{l=1}^{n} \left( Y\_{l} - \left\langle \mathfrak{z}, \mathbf{z}^{(d:1)}(X\_{l}) \right\rangle \right)^{2},\tag{11.55}$$

where *(Yi, Xi)*, 1 ≤ *i* ≤ *n*, are i.i.d. copies of *(Y, X)*. This provides us with the fitted FN network *<sup>z</sup>(d*:1*) (*·*)* and the fitted output parameter *<sup>β</sup>*. These can be used to receive an approximation to the conditional expectation, solution of (11.54),

$$\mu \mapsto \widehat{\mu}(\mathbf{x}) = \left< \widehat{\mathfrak{F}}, \widehat{\mathfrak{z}}^{(d:\mathbb{I})}(\mathbf{x}) \right> \approx \mu(\mathbf{x}) = \mathbb{E}\left[Y|X=\mathbf{x}\right].\tag{11.56}$$

This then allows us to approximate the random variable in (11.51) empirically by simulating features *X* and inserting them into left-hand side of (11.56).

#### *Remarks 11.17*

• There are different types of errors involved. First, there is an irreducible approximation error if the chosen family of FN networks is not sufficiently rich to approximate the conditional expectation well. For example, if we choose the hyperbolic tangent activation function, then, naturally, *z(d*:1*) (*·*)* is uniformly bounded for a fixed network parameter *ϑ*. This does not necessarily apply to the conditional expectation <sup>E</sup>[*<sup>Y</sup>* <sup>|</sup>*<sup>X</sup>* = ·] and, thus, the approximation in the tail may be poor. Second, we consider an approximation based on a finite sample in (11.55). However, this error can be made arbitrarily small by letting *n* → ∞. In-sample over-fitting should not be an issue as we may generate samples of arbitrary large sample sizes. Third, having the approximation (11.56), we still need to simulate i.i.d. samples *Xk*, *k* ≥ 1, having the same distribution as *X* to empirically approximate the distribution of the random variable *R<sup>τ</sup>* in (11.51). Also in this step we benefit from the fact that we can simulate infinitely many samples to mitigate this approximation error.

• To fit the network parameter *ϑ* in (11.55) we use i.i.d. copies*(Yi, Xi)*, 1 ≤ *i* ≤ *n*, that have the same distribution as *(Y, X)* under P. However, to receive a good approximation to regression function *x* → *μ(x)* we only need to simulate *Yi*|{*Xi*=*xi*} from *FY* <sup>|</sup>*<sup>x</sup>i(*·*)* <sup>=</sup> <sup>P</sup>[·|*X<sup>i</sup>* <sup>=</sup> *<sup>x</sup>i*], and *<sup>X</sup><sup>i</sup>* can be simulated from an arbitrary equivalent distribution to *px*, and we still get the right conditional expectation in (11.54). This is worth mentioning because if we need a higher precision in some part of the feature space of *X*, we can apply a sort of importance sampling by choosing a distribution for *X* that generates more samples in the corresponding part of the feature space compared to the original (true) distribution *p<sup>x</sup>* of *X*; this proposal has been emphasized in Cheridito et al. [74].

We study the example presented in Ha–Bauer [177] and Cheridito et al. [74]. This example considers a variable annuity (VA) with a guaranteed minimum income benefit (GMIB), and we revisit the network approach of Cheridito et al. [74].

*Example 11.18 (Approximation of Conditional Expectations)* We consider the VA example with a GMIB introduced and studied in Ha–Bauer [177]. This example involves a 3-dimensional stochastic process, for *t* ≥ 0,

$$X\_I = (q\_I, r\_I, m\_{\times + I}),$$

with *qt* being the log-value of the VA account at time *t*, *rt* is the short rate at time *t*, and *mx*+*<sup>t</sup>* is the force of mortality at time *t* of a person aged *x* at time 0. The payoff at fixed maturity date *T >* 1 of this insurance contract is given by

$$S = S(X\_T) = \max\left\{ e^{q\_T}, \ b \, a\_{\chi+T}(r\_T, m\_{\chi+T}) \right\},$$

where *<sup>e</sup>qT* is the VA account value at time *<sup>T</sup>* , and *b ax*+*<sup>T</sup> (rT , mx*+*<sup>T</sup> )* is the GMIB at time *T* consisting of a face value *b >* 0 and with *ax*+*<sup>T</sup> (rT , mx*+*<sup>T</sup> )* being the value of an immediate annuity at time *T* of a person aged *x* + *T* . Our goal is to model the conditional expectation

$$\mu(X\_{\mathfrak{r}}) = D(\mathfrak{r}, T; X\_{\mathfrak{r}}) \operatorname{\mathbb{E}} \left[ S(X\_T) | X\_{\mathfrak{r}} \right] \tag{11.57}$$

$$= D(\mathfrak{r}, T; X\_{\mathfrak{r}}) \operatorname{\mathbb{E}} \left[ \max \left\{ e^{\mathfrak{q}\_T}, \, b \, a\_{\mathfrak{x} + T}(r\_T, m\_{\mathfrak{x} + T}) \right\} \middle| X\_{\mathfrak{r}} \right],$$

for a fixed valuation time point 0 *<τ <T* , and where *D(τ, T )* = *D(τ, T* ; *X<sup>τ</sup> )* is a *σ (X<sup>τ</sup> )*-measurable discount factor. This requires the explicit specification of the GMIB term as a function of *(rT , mx*+*<sup>T</sup> )*, the modeling of the stochastic process *(Xt)*<sup>0</sup>≤*t*≤*<sup>T</sup>* , and the specification of the discount factor *D(τ, T* ; *X<sup>τ</sup> )*. In financial and actuarial valuation the regression function *μ(*·*)* in (11.57) should reflect a noarbitrage price. Therefore, P in (11.57) should be an equivalent martingale measure w.r.t. the selected numéraire. In our case, we choose a force of mortality *(mx*+*t)t*adjusted zero-coupon bond price as numéraire. This implies that P is a mortalityadjusted forward measure; for details and its explicit derivation we refer to Sect. 5.1 of Ha–Bauer [177]. In particular, Ha–Bauer [177] introduce a three-dimensional Brownian motion based model for *(Xt)t* from which they deduce all relevant terms explicitly. We skip these calculations here, because, once the GMIB term and the discount factor are determined, everything boils down to knowing the distribution of the random vector *(X<sup>τ</sup> , X<sup>T</sup> )* under the corresponding probability measure P. We choose initial age *x* = 55, maturity *T* = 15 and (solvency) time horizon *τ* = 1. Under the model and parametrization of Ha–Bauer [177] we receive a multivariate Gaussian distribution under P given by

*(X<sup>τ</sup> , X<sup>T</sup> )* - = *(qτ , rτ , mx*+*<sup>τ</sup> , qT , rT , mx*+*<sup>T</sup> )* - (11.58) ∼ *N* ⎛ ⎜ ⎜ ⎜ ⎝ ⎛ ⎜ ⎜ ⎜ ⎝ 4*.*64 0*.*02 0*.*01 4*.*71 0*.*02 0*.*03 ⎞ ⎟ ⎟ ⎟ ⎠*,* ⎛ ⎜ ⎜ ⎜ ⎝ <sup>3</sup>*.*<sup>2</sup> · <sup>10</sup>−<sup>2</sup> <sup>−</sup>4*.*<sup>8</sup> · <sup>10</sup>−<sup>4</sup> <sup>1</sup>*.*<sup>3</sup> · <sup>10</sup>−<sup>5</sup> <sup>3</sup>*.*<sup>1</sup> · <sup>10</sup>−<sup>2</sup> <sup>−</sup>1*.*<sup>4</sup> · <sup>10</sup>−<sup>5</sup> <sup>3</sup>*.*<sup>6</sup> · <sup>10</sup>−<sup>5</sup> <sup>−</sup>4*.*<sup>8</sup> · <sup>10</sup>−<sup>4</sup> <sup>7</sup>*.*<sup>9</sup> · <sup>10</sup>−<sup>5</sup> <sup>−</sup>4*.*<sup>4</sup> · <sup>10</sup>−<sup>7</sup> <sup>−</sup>1*.*<sup>7</sup> · <sup>10</sup>−<sup>4</sup> <sup>2</sup>*.*<sup>4</sup> · <sup>10</sup>−<sup>6</sup> <sup>−</sup>1*.*<sup>2</sup> · <sup>10</sup>−<sup>6</sup> <sup>1</sup>*.*<sup>3</sup> · <sup>10</sup>−<sup>5</sup> <sup>−</sup>4*.*<sup>4</sup> · <sup>10</sup>−<sup>7</sup> <sup>1</sup>*.*<sup>5</sup> · <sup>10</sup>−<sup>6</sup> <sup>1</sup>*.*<sup>2</sup> · <sup>10</sup>−<sup>5</sup> <sup>−</sup>1*.*<sup>3</sup> · <sup>10</sup>−<sup>8</sup> <sup>4</sup>*.*<sup>1</sup> · <sup>10</sup>−<sup>6</sup> <sup>3</sup>*.*<sup>1</sup> · <sup>10</sup>−<sup>2</sup> <sup>−</sup>1*.*<sup>7</sup> · <sup>10</sup>−<sup>4</sup> <sup>1</sup>*.*<sup>2</sup> · <sup>10</sup>−<sup>5</sup> <sup>4</sup>*.*<sup>5</sup> · <sup>10</sup>−<sup>1</sup> <sup>−</sup>1*.*<sup>3</sup> · <sup>10</sup>−<sup>3</sup> <sup>3</sup>*.*<sup>0</sup> · <sup>10</sup>−<sup>4</sup> <sup>−</sup>1*.*<sup>4</sup> · <sup>10</sup>−<sup>5</sup> <sup>2</sup>*.*<sup>4</sup> · <sup>10</sup>−<sup>6</sup> <sup>−</sup>1*.*<sup>3</sup> · <sup>10</sup>−<sup>8</sup> <sup>−</sup>1*.*<sup>3</sup> · <sup>10</sup>−<sup>3</sup> <sup>2</sup>*.*<sup>0</sup> · <sup>10</sup>−<sup>4</sup> <sup>−</sup>2*.*<sup>5</sup> · <sup>10</sup>−<sup>6</sup> <sup>3</sup>*.*<sup>6</sup> · <sup>10</sup>−<sup>5</sup> <sup>−</sup>1*.*<sup>2</sup> · <sup>10</sup>−<sup>6</sup> <sup>4</sup>*.*<sup>1</sup> · <sup>10</sup>−<sup>6</sup> <sup>3</sup>*.*<sup>0</sup> · <sup>10</sup>−<sup>4</sup> <sup>−</sup>2*.*<sup>5</sup> · <sup>10</sup>−<sup>6</sup> <sup>7</sup>*.*<sup>4</sup> · <sup>10</sup>−<sup>5</sup> ⎞ ⎟ ⎟ ⎟ ⎠ ⎞ ⎟ ⎟ ⎟ ⎠ *.*

Under the model specification of Ha–Bauer [177], one can furthermore work out the discount factor and the annuity. Define for *t* ≥ 0 and *k >* 0 the affine term structure

$$F(t,k;r\_l,m\_{\chi+l}) = \exp\left\{A(t,t+k) - B(t,t+k;\alpha)r\_l - B(t,t+k;-\kappa)m\_{\chi+l}\right\},$$

with deterministic functions

$$\begin{aligned} B(t, t+k; \alpha) &= \frac{1 - e^{-\alpha k}}{\alpha}, \\\\ A(t, t+k) &= \bar{\gamma} \left( B(t, t+k; \alpha) - k \right) + \frac{\sigma\_r^2}{2\alpha^2} \left( k - 2B(t, t+k; \alpha) + B(t, t+k; 2\alpha) \right) \\ &+ \frac{\psi^2}{2\kappa^2} \left( k - 2B(t, t+k; -\kappa) + B(t, t+k; -2\kappa) \right) \\ &+ \frac{\varrho\_{2,3} \sigma\_r \psi}{\alpha \kappa} \left( B(t, t+k; -\kappa) - k + B(t, t+k; \alpha) - B(t, t+k; \alpha-\kappa) \right), \end{aligned}$$

with parameters for the short rate process *α* = 25%, *σr* = 1%, for the force of mortality *κ* = 7%, *ψ* = 0*.*12%, the correlation between the short rate and the force of mortality 2*,*<sup>3</sup> = −4%, and with market-price of the risk-adjusted mean reversion **Fig. 11.23** Marginal densities of the VA account value *eqT* and the GMIB value *b ax*+*<sup>T</sup> (rT , mx*+*<sup>T</sup> )*

level *γ*¯ = 1*.*92% of the short rate process. These formulas can be retrieved because we work under an affine Gaussian structure. The discount factor is then given by

$$D(\mathfrak{r}, T; X\_{\mathfrak{r}}) = F(\mathfrak{r}, T - \mathfrak{r}; r\_{\mathfrak{r}}, m\_{\mathfrak{x} + \mathfrak{r}}),$$

and the annuity is determined by (we cap at age 55 + 50 = 105)

$$a\_{\chi+T}(r\_T, m\_{\chi+T}) = \sum\_{k=1}^{50} F(T, k; r\_T, m\_{\chi+T}).$$

Moreover, we set for the face value *b* = 10*.*79205. This parametrization implies that the VA account value *<sup>e</sup>qT* exceeds the GMIB *b ax*+*<sup>T</sup> (rT , mx*+*<sup>T</sup> )* with a probability of roughly 40%, i.e., in roughly 60% of the cases we exercise the GMIB option. Figure 11.23 shows the marginal densities of these two variables, moreover, their correlation is close to 0.

The model is now fully specified so that we can estimate the conditional expectation in (11.57) as a function of *X<sup>τ</sup>* . We therefore simulate *n* = 3 000 000 i.i.d. Gaussian observations *(X(i) <sup>τ</sup> , X(i) <sup>T</sup> )*, 1 ≤ *i* ≤ *n*, from (11.58). This provides us with the observations

$$\begin{aligned} Y\_l &= D(\boldsymbol{\tau}, T; X\_{\boldsymbol{\tau}}^{(l)}) \; S(X\_T^{(l)})\\ &= F(\boldsymbol{\tau}, T - \boldsymbol{\tau}; r\_{\boldsymbol{\tau}}^{(l)}, m\_{\boldsymbol{x}+\boldsymbol{\tau}}^{(l)}) \; \max \left\{ e^{q\_T^{(l)}}, b \; \sum\_{k=1}^{50} F(T, k; r\_{T}^{(l)}, m\_{\boldsymbol{x}+T}^{(l)}) \right\}. \end{aligned}$$

The resulting data *(Yi, X(i) <sup>τ</sup> )*1≤*i*≤*<sup>n</sup>* is used for determining the regression function *μ(*·*)* in (11.57). We choose *n* = 3 000 000 samples in line with the least squares Monte Carlo approximation of Ha–Bauer [177].

We choose a FN network of depth *d* = 3 for approximating*μ(*·*)*. For the three FN layers we choose *(q*1*, q*2*, q*3*)* = *(*20*,* 15*,* 10*)* neurons with the hyperbolic tangent activation function, and as output activation we choose the identity function; we choose a more complex network compared to Cheridito et al. [74] because it seems that this gives us more accurate results. We fit this FN network using the square loss function. The square loss is motivated by (11.55). Furthermore, we average over 20 runs with different seeds. Thus, we receive 20 fitted FN networks *μk(*·*)* for the 20 different seeds 1 ≤ *k* ≤ 20 and the nagging predictor is obtained by averaging

$$
\widehat{\mu}(\cdot) = \frac{1}{20} \sum\_{k=1}^{20} \widehat{\mu}\_k(\cdot).
$$

We then generate new i.i.d. samples *X(l) <sup>τ</sup>* , 1 ≤ *l* ≤ *L*, from the multivariate Gaussian distribution (11.58), where this time we only need the first 3 components. This gives us the empirical samples

$$
\widehat{\mu}(\mathbf{X}\_t^{(l)}) \qquad \text{ for } 1 \le l \le L,\tag{11.59}
$$

providing an empirical distribution *F μ(X<sup>τ</sup> )* that approximates the distribution of *μ(X<sup>τ</sup> )*, given in (11.57). In risk management and solvency analysis, this empirical distribution can be used to estimate the Value-at-Risk (VaR) and the (upper) conditional tail expectation (CTE) in valuation *μ(X<sup>τ</sup> )*, seen from time 0, on different safety levels *p* ∈ *(*0*,* 1*)*

$$\widehat{\text{VaR}}\_{p} = \widehat{F}\_{\mu(X\_{\text{r}})}^{-1}(p) = \inf \left\{ \mathbf{y} \in \mathbb{R}; \, \widehat{F}\_{\mu(X\_{\text{r}})}(\mathbf{y}) \ge p \right\},$$

and

$$\widehat{\text{CTE}}\_{\mathcal{P}} = \mathbb{E}\_{\widehat{F}\_{\mu(X\_{\mathcal{I}})}} \left[ \widehat{\mu}(X\_{\mathcal{I}}) \, \big|\, \widehat{\mu}(X\_{\mathcal{I}}) > \widehat{\text{VaR}}\_{\mathcal{P}} \right].$$

We also refer to Sect. 11.3. The VaR and the CTE are two commonly used risk measures in insurance practice that determine the necessary risk bearing capital to run the corresponding insurance business. Typically, the VaR is evaluated on *p* = 99*.*5%, i.e., we allow for a default probability of 0.5% of not being able to cover the changes in valuation over a *τ* = 1 year time horizon. Alternatively, the CTE is considered on *p* = 99% which means that we need sufficient capital to cover on average the 1% worst changes in valuation over a 1 year time horizon.

Figure 11.24 shows our FN network approximations. The boxplots shows the individual results of the estimates *μk(*·*)* with 20 different seeds, and the horizontal lines show the results of the nagging predictor (11.59). The red line at 140.97 gives the estimated VaR for *p* = 99*.*5%, this value is slightly bigger than the best estimate of 139.47 (orange line) in Ha–Bauer [177] which is based on a functional approximation involving 37 monomials and 40'000'000 simulated samples. CTEs on *p* = 99*.*5% and *p* = 99% are given by 145.09 and 141.49. We conclude that in the present example VaR <sup>G</sup>99*.*5% (used in Europe) and CTE <sup>G</sup> 99% (used in Switzerland) are approximately of the same size for this VA with a GMIB.

This example shows how problems can be solved that require the computation of a conditional expectation. Alternatively, we could explore the LocalGLMnet architecture, which would allow us to explain the conditional expectation more explicitly in terms of the information *X<sup>τ</sup>* available at time *τ* . This may also be relevant in practice because it allows to determine the main risk drivers of the underlying insurance business.

Figure 11.25 shows the marginal densities of the components of *X<sup>τ</sup>* = *(qτ , rτ , mx*+*<sup>τ</sup> )* in blue color. In red color we show the corresponding conditional densities of *<sup>X</sup><sup>τ</sup>* , conditioned on *μ(X<sup>τ</sup> ) >* VaR <sup>G</sup>99*.*5%, thus, these are the feature values *<sup>X</sup><sup>τ</sup>* that lead to a shortfall beyond the 99.5% VaR of *μ(X<sup>τ</sup> )*. From this figure we conclude that the main driver of VaR is the VA account variable *qτ* , whereas the short rate *rτ* and the force of mortality *mx*+*<sup>τ</sup>* are slightly lower beyond the VaR compared to their unconditioned counterparts. The explanation for these smaller values is that they lead to less discounting and, henceforth, to bigger GMIB values. This is useful information for exploring importance sampling as mentioned in Remarks 11.17. This closes the example. -

**Fig. 11.25** Feature values *X<sup>τ</sup>* triggering VaR on the 99.5% level: (lhs) VA account log-value *qτ* , (middle) short rate *rτ* , and (rhs) force of mortality *mx*+*<sup>τ</sup>* , blue color shows the full density and red color shows the conditional density conditioned on being above the 99.5% VaR of *μ(X<sup>τ</sup> )*

## *11.6.3 Bayesian Networks: An Outlook*

This section provides a short introduction to Bayesian networks and to variational inference. We see this section as a motivation for doing more research in that direction. In Sect. 11.4 we have assessed model uncertainty through bootstrapping. Alternatively, we could take a Bayesian viewpoint. We start from a fixed network architecture that involves a network parameter *ϑ*. The Bayesian approach considered in Section 6.1 selects a prior density *π(ϑ)* on the space of network parameters (w.r.t. a measure *ν*). For given data *(Y, x)* we can then calculate the posterior density of *ϑ* by

$$
\pi \left( \mathfrak{d} \mid Y, \mathfrak{x} \right) \propto f \left( Y, \mathfrak{d} \mid \mathfrak{x} \right) = f \left( Y \mid \mathfrak{d}, \mathfrak{x} \right) \pi(\mathfrak{d}).\tag{11.60}
$$

A new data point *Y* † with feature *x*† has conditional density, given observation *(Y, x)*,

$$f\left(\mathbf{y}^{\dagger}\left|\mathbf{x}^{\dagger};Y,\mathbf{x}\right.\right) = \int\_{\mathfrak{Y}} f\left(\mathbf{y}^{\dagger}\left|\,\mathfrak{Y},\mathbf{x}^{\dagger}\right.\right) \pi\left(\left.\mathfrak{Y}\right|Y,\mathbf{x}\right) d\nu(\mathfrak{Y}),$$

supposed that *(Y, x)* and *(Y* †*, x*†*)* are conditionally independent, given *ϑ*. Thus, there only remains to determine the posterior density (11.60) of the network parameter *ϑ*. Unfortunately, this is a rather challenging problem because of the curse of dimensionality, and even advanced MCMC methods, such as HMC, often do not lead to satisfactory results (convergence), for MCMC we refer to Section 6.1. For this reason one often explores approximate inference methods, see, e.g., Chapter 10 of Bishop [36] or the tutorial of Jospin et al. [205]. A scalable version is to approximate the posterior density using the so-called method of variational inference. This is presented in the following.

Choose a family *F* = {*q(*·; *θ )*; *θ* ∈ } of (more tractable) densities that have the same support as the prior *π(*·*)*, and being parametrized by *<sup>θ</sup>* <sup>∈</sup> <sup>⊂</sup> <sup>R</sup>*K*. This family *F* is called the set of variational distributions, and the goal is to find the variational density *q(*·; *θ )* ∈ *F* that is closest to the posterior density (11.60).

To evaluate the similarity between two densities, we use the KL divergence which analyzes the divergence from *π (*·| *Y, x)* to *q(*·; *θ )* given by

$$D\_{\mathrm{KL}}\left(q\left(\cdot;\theta\right)\middle|\,\pi\left(\cdot\mid Y,\mathtt{x}\right)\right) = \int\_{\mathfrak{Y}} q\left(\mathfrak{d}\left\langle\theta\right\rangle\middle|\,\mathrm{log}\left(\frac{q\left(\mathfrak{d}\left\langle\theta\right\rangle\right)}{\pi\left(\left\langle\theta\right|\middle| Y,\mathtt{x}\right)}\right)d\nu(\mathfrak{d})\,.\mathrm{d}$$

The optimal approximation within *F*, for given data *(Y, x)*, is found by solving

$$\widehat{\theta} = \widehat{\theta}(Y, \mathfrak{x}) \;=\underset{\theta \in \Theta}{\text{arg min }} D\_{\text{KL}}\left(q\left(\cdot; \theta\right) \middle| \, \mathfrak{x}\left(\cdot \mid Y, \mathfrak{x}\right)\right);$$

for the moment we neglect existence and uniqueness questions. A main difficulty is the computation of this KL divergence because it involves the intractable posterior density of *ϑ*, given *(Y, x)*. We modify the optimization problem such that we can circumvent the explicit calculation of this KL divergence.

**Lemma 11.19** *We have the following identity*

$$
\log f(Y|\mathbf{x}) = \mathcal{E}(\theta|Y,\mathbf{x}) + D\_{\rm KL}\left(q(\cdot;\theta) \Big|\Big| \pi \left(\cdot|Y,\mathbf{x}\right)\right),
$$

*for the (unconditional) density f (y*|*x)* = *<sup>ϑ</sup> f (y*|*ϑ, x)π(ϑ)dν(ϑ) and the socalled evidence lower bound (ELBO)*

$$\mathcal{E}(\theta|Y,\mathbf{x}) = \int\_{\mathfrak{Y}} q(\vartheta;\theta) \log \left( \frac{f\left(Y,\mathfrak{Y}|\mathbf{x}\right)}{q\left(\mathfrak{Y};\theta\right)} \right) d\upsilon(\mathfrak{Y}).$$

Observe that the left-hand side in the statement of Lemma 11.19 is independent of *θ* ∈ . Therefore, minimizing the KL divergence in *θ* is equivalent to maximizing the ELBO in *θ*. This follows exactly the same philosophy as the EM algorithm, see (6.32), in fact, the ELBO *E* plays the role of functional *Q* defined in (6.33). *Proof of Lemma 11.19* We start from the left-hand side of the statement

$$\begin{split} \log f(Y|\mathbf{x}) &= \int\_{\mathfrak{Y}} q(\mathfrak{d};\theta) \log f(Y|\mathbf{x}) \, d\nu(\mathfrak{d}) = \int\_{\mathfrak{Y}} q(\mathfrak{d};\theta) \log \left( \frac{f(Y,\mathfrak{d}|\mathbf{x})}{\pi(\mathfrak{d}|Y,\mathbf{x})} \right) d\nu(\mathfrak{d}) \\ &= \int\_{\mathfrak{Y}} q(\mathfrak{d};\theta) \log \left( \frac{f(Y,\mathfrak{d}|\mathbf{x})/q(\mathfrak{d};\theta)}{\pi(\mathfrak{d}|Y,\mathbf{x})/q(\mathfrak{d};\theta)} \right) d\nu(\mathfrak{d}) \\ &= \mathcal{E}(\theta|Y,\mathbf{x}) + D\_{\text{KL}} \left( q(\cdot;\theta) \Big| \Big| \pi \, (\cdot|Y,\mathbf{x}) \right). \end{split}$$

This proves the claim.

The ELBO provides the lower bound (also called variational lower bound)

$$\log f(Y|\mathbf{x}) \ge \sup\_{\theta \in \Theta} \mathcal{E}(\theta | Y, \mathbf{x}).$$

Interestingly, the ELBO does not include the posterior density, but only the joint density of *Y* and *ϑ*, given *x*, which is assumed to be known (available). It can be rewritten as

$$\mathcal{E}(\boldsymbol{\theta}|\boldsymbol{Y},\boldsymbol{\mathfrak{x}}) = \int\_{\boldsymbol{\mathfrak{Y}}} q(\boldsymbol{\mathfrak{y}};\boldsymbol{\theta}) \log f \ (\boldsymbol{Y},\boldsymbol{\mathfrak{y}}|\boldsymbol{\mathfrak{x}}) \ d\boldsymbol{\upsilon}(\boldsymbol{\mathfrak{y}}) - \int\_{\boldsymbol{\mathfrak{Y}}} q(\boldsymbol{\mathfrak{y}};\boldsymbol{\mathfrak{y}}) \log q(\boldsymbol{\mathfrak{y}};\boldsymbol{\mathfrak{y}}) \ d\boldsymbol{\upsilon}(\boldsymbol{\mathfrak{y}}),$$
 
$$= \mathbb{E}\_{q(\cdot;\boldsymbol{\theta})} \Big[ \log f \ (\boldsymbol{Y},\boldsymbol{\mathfrak{y}}|\boldsymbol{\mathfrak{x}}) \Big] \boldsymbol{Y},\boldsymbol{\mathfrak{x}} \Big] - \mathbb{E}\_{q(\cdot;\boldsymbol{\theta})} \Big[ \log q(\boldsymbol{\mathfrak{y}};\boldsymbol{\mathfrak{y}}) \Big],$$

the first term being the expected joint log-likelihood of *(Y, ϑ)* under the variational density *ϑ* ∼ *q(*·; *θ )*, and the second term being the entropy of the variational density.

The optimal approximation within *F* for given data *(Y, x)* is then found by solving

$$
\widehat{\theta} = \widehat{\theta}(Y, \mathbf{x}) \; = \operatorname\*{arg\,max}\_{\theta \in \Theta} \mathcal{E}(\theta | Y, \mathbf{x}).
$$

That is we try to simultaneously maximize the expected joint log-likelihood of *(Y, ϑ)* and the entropy over all variational densities *q(*·; *θ )* in *F*.

If we have multiple observations *D* = {*(Yi, xi)*; 1 ≤ *i* ≤ *n*}, that are conditionally i.i.d., given *ϑ*, we have to solve (we use conditional independence)

$$\begin{split} \boldsymbol{\theta} &= \operatorname\*{arg\,max}\_{\boldsymbol{\theta} \in \Theta} \mathcal{E}(\boldsymbol{\theta} | \mathcal{D}) \\ &= \operatorname\*{arg\,max}\_{\boldsymbol{\theta} \in \Theta} \mathbb{E}\_{q(\cdot;\boldsymbol{\theta})} \left[ \log \left( \pi(\boldsymbol{\theta}) \prod\_{i=1}^{n} f \left( Y\_{i} | \, \boldsymbol{\Phi}, \mathbf{x}\_{i} \right) \right) \Big| \mathcal{D} \right] - \mathbb{E}\_{q(\cdot;\boldsymbol{\theta})} \Big[ \log q(\boldsymbol{\theta}; \boldsymbol{\theta}) \Big] \\ &= \operatorname\*{arg\,max}\_{\boldsymbol{\theta} \in \Theta} \left( \sum\_{i=1}^{n} \mathbb{E}\_{q(\cdot;\boldsymbol{\theta})} \Big[ \log f \left( Y\_{i} | \, \boldsymbol{\Phi}, \mathbf{x}\_{i} \right) \Big| Y\_{i}, \mathbf{x}\_{i} \Big] \right) - \mathbb{E}\_{q(\cdot;\boldsymbol{\theta})} \Big[ \log \left( \frac{q(\boldsymbol{\theta}; \boldsymbol{\theta})}{\pi(\boldsymbol{\theta})} \right) \Big] \\ &= \operatorname\*{arg\,max}\_{\boldsymbol{\theta} \in \Theta} \left( \sum\_{i=1}^{n} \mathbb{E}\_{q(\cdot;\boldsymbol{\theta})} \Big[ \log f \left( Y\_{i} | \, \boldsymbol{\Phi}, \mathbf{x}\_{i} \right) \Big| Y\_{i}, \mathbf{x}\_{i} \Big] \right) - D\_{\mathrm{KL}} \left( q(\cdot;\boldsymbol{\theta}) \| \, \boldsymbol{\pi} \right) . \end{split}$$

Typically, one solves this problem with gradient ascent methods which requires calculation of the gradient ∇*<sup>θ</sup>* of the objective function on the right-hand side. This is more difficult than plain vanilla gradient descent in network fitting because *θ* enters the expectation operator <sup>E</sup>*q(*·;*θ )*.

Kingma–Welling [217] propose to use the following reparametrization trick. Assume that we can receive the random variable *ϑ* ∼ *q(*·; *θ )* by a reparametrization *ϑ (*d*)* = *t (, θ )* for some smooth function *t* and where ∼ *p* does not depend on *θ*. E.g., if *ϑ* is multivariate Gaussian with mean *μ* and covariance matrix *AA*-, then *ϑ (*d*)* = *μ* + *A* for being standard multivariate Gaussian. Under the assumption that the reparametrization trick works for the family *F* = {*q(*·; *θ )*; *θ* ∈ } we arrive at, for ∼ *p*,

$$\begin{split} \widehat{\theta} &= \underset{\theta \in \Theta}{\arg\max} \mathcal{E}(\theta | \mathcal{D}) \\ &= \underset{\theta \in \Theta}{\arg\max} \sum\_{l=1}^{n} \left( \mathbb{E}\_{p} \Big[ \log f \left( Y\_{l} | \boldsymbol{\iota}(\boldsymbol{\epsilon}, \boldsymbol{\theta}), \boldsymbol{\mathbf{x}}\_{l} \Big| \boldsymbol{Y}\_{l}, \boldsymbol{\mathbf{x}}\_{l} \right) - \frac{1}{n} \mathbb{E}\_{p} \Big[ \log \left( \frac{q(\boldsymbol{\iota}(\boldsymbol{\epsilon}, \boldsymbol{\theta}); \boldsymbol{\theta})}{\pi(\boldsymbol{\iota}(\boldsymbol{\epsilon}, \boldsymbol{\theta}))} \right) \Big] \right) \\ &= \underset{\theta \in \Theta}{\arg\max} \sum\_{l=1}^{n} \mathbb{E}\_{p} \Bigg[ \log \left( \frac{f \left( Y\_{l} | \boldsymbol{\iota}(\boldsymbol{\epsilon}, \boldsymbol{\theta}), \boldsymbol{\mathbf{x}}\_{l} \right) \pi(\boldsymbol{\iota}(\boldsymbol{\epsilon}, \boldsymbol{\theta}))^{1/n}}{q \left( t(\boldsymbol{\epsilon}, \boldsymbol{\theta}); \boldsymbol{\theta} \right)^{1/n}} \Bigg) \Big| \, Y\_{l}, \boldsymbol{x}\_{l} \Big] . \end{split}$$

The gradient of the ELBO is then given by (supposed we can exchange <sup>E</sup>*<sup>p</sup>* and <sup>∇</sup>*<sup>θ</sup>* )

$$\nabla\_{\theta} \mathcal{E}(\theta | \mathcal{D}) = \sum\_{l=1}^{n} \mathbb{E}\_{p} \left[ \nabla\_{\theta} \log \left( \frac{f \left( Y\_{l} \left| t(\boldsymbol{\epsilon}, \theta), \mathbf{x}\_{l} \right) \pi \left( t(\boldsymbol{\epsilon}, \theta) \right)^{1/n} \right)}{q \left( t(\boldsymbol{\epsilon}, \theta); \theta \right)^{1/n}} \right) \bigg| Y\_{l}, \mathbf{x}\_{l} \right].$$

These expected gradients are calculated empirically using Monte Carlo methods. Sample i.i.d. observations *(i,j )* <sup>∼</sup> *<sup>p</sup>*, 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>* and 1 <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>m</sup>*, and consider the empirical approximation

$$\nabla\_{\boldsymbol{\theta}}\mathcal{E}(\boldsymbol{\theta}|\mathcal{D}) \approx \sum\_{i=1}^{n} \frac{1}{m} \sum\_{j=1}^{m} \nabla\_{\boldsymbol{\theta}} \log \left( \frac{f\left(Y\_{i}\left|t(\boldsymbol{\epsilon}^{(i,j)},\boldsymbol{\theta}), \mathbf{x}\_{i}\right) \pi\left(t(\boldsymbol{\epsilon}^{(i,j)},\boldsymbol{\theta})\right)^{1/n}}{q\left(t(\boldsymbol{\epsilon}^{(i,j)},\boldsymbol{\theta}); \boldsymbol{\theta}\right)^{1/n}}\right) . \tag{11.62}$$

Using this empirical approximation we can use gradient ascent methods to estimate *θ*, known as stochastic gradient variational Bayes (SGVB) estimator, see Sect. 2.4.3 of Kingma–Welling [217], or as Bayes by Backprop, see Blundell et al. [41] and Jospin et al. [205].

*Example 11.20* We consider the gradient (11.62) for an example from the EDF. First, if *n* is sufficiently large, it often suffices to set *m* = 1, and we still receive an accurate estimate. In that case we drop index *j* giving *(i)*. Assume that the (conditionally independent) observations *Yi* belong to the same member of the EDF having cumulant function *κ*. Moreover, assume that the (conditional) mean of *Yi*, given *xi*, can be described by a FN network and a link function *g* such that, see (7.8),

$$
\mu\_l = \mu(\mathfrak{x}\_l) = \mu\_\varnothing(\mathfrak{x}\_l) = \mathfrak{g}^{-1}\left\langle \mathfrak{g}, z\_w^{(d:1)}(\mathfrak{x}\_l) \right\rangle,
$$

for network parameter *<sup>ϑ</sup>* <sup>=</sup> *(β, <sup>w</sup>)* <sup>∈</sup> <sup>R</sup>*r*. In a Bayesian FN network this network parameter is not fixed but rather acts as a latent variable. In (11.62) this latent variable is for realization *i* given by (and using the reparametrization trick) *ϑ* = *t ((i)*; *θ )* <sup>∈</sup> <sup>R</sup>*r*; *<sup>θ</sup>* is not the canonical parameter, here. Thus, we receive conditional mean of *Yi*, given *(i)* and *xi*,

$$
\mu\_i = \mu\_{\mathfrak{l}(\epsilon^{(l)};\theta)}(\mathfrak{x}\_l) = \operatorname{g}^{-1} \left\langle \mathfrak{f}(\epsilon^{(l)}; \theta), \mathfrak{z}\_{\mathfrak{w}(\epsilon^{(l)}; \theta)}^{(d:l)}(\mathfrak{x}\_l) \right\rangle,
$$

with network parameter *<sup>ϑ</sup>((i)*; *θ )* <sup>=</sup> *(β((i)*; *θ ), <sup>w</sup>((i)*; *θ ))* <sup>=</sup> *t ((i),θ)* <sup>∈</sup> <sup>R</sup>*r*. Maximizing the ELBO implies that we need to calculate the gradients w.r.t. *θ*. First, we calculate the gradient w.r.t. the network parameter *ϑ* of the data log-likelihood

$$\nabla\_{\boldsymbol{\theta}} \log f \; (Y\_l \; | \boldsymbol{\vartheta}, \boldsymbol{x}\_l) = \nabla\_{\boldsymbol{\theta}} \ell\_{Y\_l} (\boldsymbol{\vartheta}) \in \mathbb{R}^r.$$

This gradient is calculated with back-propagation, we refer to (7.16) and Proposition 7.5. There remains the chain rule for evaluating the inner derivative coming from the reparametrization trick *<sup>θ</sup>* <sup>∈</sup> <sup>⊂</sup> <sup>R</sup>*<sup>K</sup>* <sup>→</sup> *<sup>ϑ</sup>* <sup>=</sup> *t ((i)*; *θ )* <sup>∈</sup> <sup>R</sup>*r*. Consider the Jacobian matrix

$$J(\theta; \epsilon^{(i)}) = \left(\frac{\partial}{\partial \theta\_k} t\_j(\epsilon^{(i)}; \theta)\right)\_{1 \le j \le r, 1 \le k \le K} \in \mathbb{R}^{r \times K}.$$

This gives us the gradient w.r.t. *θ*

$$\nabla \theta \log f \left( Y\_l \left| t(\epsilon^{(l)}, \theta), \mathbf{x}\_l \right. \right) = J(\theta; \epsilon^{(l)})^\top \left( \nabla\_{\theta} \ell\_{Y\_l}(\theta) \Big|\_{\theta = t(\epsilon^{(l)}, \theta)} \right) \tag{11.63} \quad (11.63)$$

The prior distribution is often taken to be the multivariate Gaussian with prior mean *<sup>τ</sup>* <sup>∈</sup> <sup>R</sup>*<sup>r</sup>* and (symmetric and positive definite) prior covariance matrix *<sup>T</sup>* <sup>∈</sup> <sup>R</sup>*r*×*r*, thus,

$$\pi(\mathfrak{d}) = ((2\pi)^{r/2}|T|^{1/2})^{-1} \exp\left\{ -\frac{1}{2}(\mathfrak{d}-\mathfrak{r})^{\top}T^{-1}(\mathfrak{d}-\mathfrak{r}) \right\}.$$

This implies for the gradient w.r.t. *θ* for the prior

$$\nabla\_{\boldsymbol{\theta}} \log \pi(\boldsymbol{t}(\boldsymbol{\epsilon}^{(l)}, \boldsymbol{\theta})) = -J(\boldsymbol{\theta}; \boldsymbol{\epsilon}^{(l)})^{\top} \boldsymbol{T}^{-1} \left( \boldsymbol{t}(\boldsymbol{\epsilon}^{(l)}, \boldsymbol{\theta}) - \boldsymbol{\mathsf{t}} \right) \; \in \; \mathbb{R}^{K} \; .$$

There remains the choice of the family *F* = {*q(*·; *θ )*; *θ* ∈ } of variational densities such that the reparametrization trick works. This is discussed in the remainder. -

We briefly discuss the most popular and simplest family chosen for the variational distributions *F*. This family is the so-called mean field Gaussian variational family, meaning that all components of *<sup>ϑ</sup>* <sup>∈</sup> <sup>R</sup>*<sup>r</sup>* are assumed to be independent Gaussian, that is,

$$q(\vartheta;\theta) = \prod\_{j=1}^{r} \frac{1}{\sqrt{2\pi}\sigma\_j} \exp\left\{-\frac{1}{2\sigma\_j^2} (\vartheta\_j - \mu\_j)^2\right\},$$

for *θ* = *(μ*1*, σ*1*,...,μr, σr)*- <sup>∈</sup> <sup>R</sup>*<sup>K</sup>* with *<sup>K</sup>* <sup>=</sup> <sup>2</sup>*<sup>r</sup>* and with *σj <sup>&</sup>gt;* 0 for all 1 <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *r*. This allows us to apply the reparametrization trick

$$\mathfrak{d} \stackrel{(\mathrm{d})}{=} t(\epsilon, \theta) = \mu + \mathrm{diag}(\sigma\_1, \dots, \sigma\_r)\epsilon = \begin{pmatrix} \mu\_1 + \sigma\_1 \epsilon\_1\\ \vdots\\ \mu\_r + \sigma\_r \epsilon\_r \end{pmatrix},$$

with *<sup>r</sup>*-dimensional standard Gaussian variable <sup>∼</sup> *<sup>N</sup> (***0***,* <sup>1</sup>*)*. The Jacobian matrix is

$$J(\theta; \epsilon) = \begin{pmatrix} 1 \; \epsilon\_1 \; 0 \; 0 & \cdots \; 0 \; 0 \\ 0 \; 0 \; 1 \; \epsilon\_2 \; \cdots \; 0 \; 0 \\ \vdots & \ddots & \vdots \\ 0 \; 0 \; 0 \; 0 & \cdots \; 1 \; \epsilon\_r \end{pmatrix} \in \mathbb{R}^{r \times K}.$$

The mean field Gaussian case provides the entropy of the variational distribution

$$-\mathbb{E}\_{q(\cdot;\theta)}\left[\log q(\theta;\theta)\right] = \sum\_{j=1}^{r} \frac{1}{2} \log(2\pi \sigma\_j^2) + \frac{1}{2} = \sum\_{j=1}^{r} \log(\sqrt{2\pi}e\sigma\_j).$$

This mean field Gaussian variational inference can be implemented with the R package tfprobability of Keydana et al. [212] and an explicit example is given in Kuo [230].

*Example 11.20, Revisited* Working under the assumptions of Example 11.20 and additionally assuming that the family of variational distributions *F* is multivariate Gaussian *q(*·; *θ ) (*d*)* = *N (μ,)* leads us after some calculation to (the well-known formula)

$$D\_{\rm KL}\left(q\left(\cdot;\theta\right)\middle|\,\pi\right) = \frac{1}{2}\left[\log\left(\frac{|T|}{|\Sigma|}\right) - r + \text{trace}\left(T^{-1}\Sigma\right) + \left(\tau - \mu\right)^{\top}T^{-1}\left(\tau - \mu\right)\right].$$

This further simplifies if *T* and are diagonal, the latter being the mean field Gaussian case. The remaining terms of the ELBO are treated empirically as in (11.63). -

This section has provided a short introduction to uncertainty estimation in networks using Bayesian methods. We believe that this gives a promising outlook that certainly needs more theoretical and practical work to become useful in practical applications.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 12 Appendix A: Technical Results on Networks**

The reader may have noticed that for GLMs we have developed an asymptotic theory that allowed us to assess the quality of predictors as well as it allowed us to validate the fitted models. For networks there does not exist such a theory, yet, and the purpose of this appendix is to present more technical results on the asymptotic behavior of FN networks and their estimators that may lead to an asymptotic theory. This appendix hopefully stimulates further research in this field of statistical modeling.

## **12.1 Universality Theorems**

We present a specific version of the universality theorems for shallow FN networks; we refer to the discussion in Sect. 7.2.2. This section follows Hornik et al. [192]. Choose an input dimension *<sup>q</sup>*<sup>0</sup> <sup>∈</sup> <sup>N</sup> and consider the set of all affine functions

$$\mathcal{A}^{q0} = \left\{ A : \{1\} \times \mathbb{R}^{q0} \to \mathbb{R} ; \quad \mathfrak{x} \mapsto A(\mathfrak{x}) = \langle \mathfrak{w}, \mathfrak{x} \rangle, \ \mathfrak{w} \in \mathbb{R}^{q0+1} \right\},$$

we add a 0th component in feature *x* = *(x*<sup>0</sup> = 1*, x*1*,...,xq*<sup>0</sup> *)*- ∈ {1} × <sup>R</sup>*q*<sup>0</sup> for the intercept. Choose a measurable (activation) function *<sup>φ</sup>* : <sup>R</sup> <sup>→</sup> <sup>R</sup> and define

$$\Sigma^{q\_0}(\phi) = \left\{ f : \{1\} \times \mathbb{R}^{q\_0} \to \mathbb{R} ; \ \mathbf{x} \mapsto f(\mathbf{x}) = \sum\_{j=0}^{q\_1} \beta\_j \phi(A\_j(\mathbf{x})), \ A\_j \in \mathcal{A}^{q\_0}, \beta\_j \in \mathbb{R}, q\_1 \in \mathbb{N} \right\}.$$

This is the set of all shallow FN networks *f (x)* <sup>=</sup>*β, <sup>z</sup>(*1:1*) (x)* with activation function *φ* and the linear output activation, see (7.8); the intercept component of the output is integrated into the 0th component *j* = 0. Moreover, we define the networks

$$\begin{aligned} \Sigma \amalg^{q\_0}(\phi) = \left\{ f : \{1\} \times \mathbb{R}^{q\_0} \to \mathbb{R} ; \mapsto \ f(\mathbf{x}) = \sum\_{j=0}^{q\_1} \beta\_j \prod\_{k=1}^{l\_j} \phi(A\_{j,k}(\mathbf{x})), \\\\ A\_{j,k} \in \mathcal{A}^{q\_0}, \beta\_j \in \mathbb{R}, l\_j \in \mathbb{N}, q\_1 \in \mathbb{N} \right\}. \end{aligned}$$

The latter networks contain the former *<sup>q</sup>*<sup>0</sup> *(φ)* <sup>⊂</sup> *%q*<sup>0</sup> *(φ)*, by setting *lj* <sup>=</sup> 1 for all 0 ≤ *j* ≤ *q*1. We are going to prove a universality theorem first for the networks *%q*<sup>0</sup> *(φ)*, and afterwards for the shallow FN networks *<sup>q</sup>*<sup>0</sup> *(φ)*.

**Definition 12.1** The function *<sup>φ</sup>* : <sup>R</sup> → [0*,* <sup>1</sup>] is called a squashing function if it is non-decreasing with lim*x*→−∞ *φ(x)* = 0 and lim*x*→∞ *φ(x)* = 1.

Since squashing functions can have at most countably many discontinuities, they are measurable; a continuous and a non-continuous example are given by the sigmoid and by the step function activation, respectively, see Table 7.1.

**Lemma 12.2** *The sigmoid activation function is Lipschitz with constant* 1*/*4*.*

*Proof* The derivative of the sigmoid function is given by *φ* = *φ(*1 − *φ)*. This provides for the second derivative *φ* = *φ* − 2*φφ* = *φ (*1 − 2*φ)*. The latter is zero for *φ(x)* = 1*/*2. This says that the maximal slope of *φ* is attained for *x* = 0 and it is *φ (*0*)* = 1*/*4.

We denote by *<sup>C</sup>(*R*q*<sup>0</sup> *)* the set of all continuous functions from {1} × <sup>R</sup>*q*<sup>0</sup> to <sup>R</sup>, and by *<sup>M</sup>(*R*q*<sup>0</sup> *)* the set of all measurable functions from {1} × <sup>R</sup>*q*<sup>0</sup> to <sup>R</sup>. If the measurable activation function *<sup>φ</sup>* is continuous, we have *%q*<sup>0</sup> *(φ)* <sup>⊂</sup> *<sup>C</sup>(*R*q*<sup>0</sup> *)*, otherwise *%q*<sup>0</sup> *(φ)* <sup>⊂</sup> *<sup>M</sup>(*R*q*<sup>0</sup> *)*.

**Definition 12.3** A subset *<sup>S</sup>* <sup>⊂</sup> *<sup>M</sup>(*R*q*<sup>0</sup> *)* is said to be uniformly dense on compacta in *<sup>C</sup>(*R*q*<sup>0</sup> *)* if for every compact subset *<sup>K</sup>* ⊂ {1}×R*q*<sup>0</sup> the set *<sup>S</sup>* is *ρK*-dense in *<sup>C</sup>(*R*q*<sup>0</sup> *)* meaning that for all *<sup>&</sup>gt;* 0 and all *<sup>g</sup>* <sup>∈</sup> *<sup>C</sup>(*R*q*<sup>0</sup> *)* there exists *<sup>f</sup>* <sup>∈</sup> *<sup>S</sup>* such that

$$\rho\_K(\mathbf{g}, f) = \sup\_{\mathbf{x} \in K} |\mathbf{g}(\mathbf{x}) - f(\mathbf{x})| < \epsilon.$$

**Theorem 12.4 (Theorem 2.1 in Hornik et al. [192])** *Assume φ is a non-constant and continuous activation function. %q*<sup>0</sup> *(φ)* <sup>⊂</sup> *<sup>C</sup>(*R*q*<sup>0</sup> *) is uniformly dense on compacta in <sup>C</sup>(*R*q*<sup>0</sup> *).*

*Proof* The proof is based on the Stone–Weierstrass theorem. We briefly recall the Stone–Weierstrass theorem. Assume *A* is a family of real functions defined on a set *E*. *A* is called an *algebra* if it is closed under addition, multiplication and scalar multiplication. A family *A separates points* in *E*, if for every *x,z* ∈ *E* with *x* = *z* there exists a function *A* ∈ *A* with *A(x)* = *A(z)*. The family *A* does *not vanish at any point* of *E* if for all *x* ∈ *E* there exists a function *A* ∈ *A* such that *A(x)* = 0.

Let *A* be an algebra of continuous real functions on a compact set *K*. The Stone– Weierstrass theorem says that if *A* separates points in *K* and if it does not vanish at any point of *K*, then *A* is *ρK*-dense in the space of all continuous real functions on *K*.

Choose any compact set *<sup>K</sup>* ⊂ {1}×R*q*<sup>0</sup> . For any activation function *<sup>φ</sup>*, *%q*<sup>0</sup> *(φ)* is obviously an algebra. So there remains to prove that this algebra separates points and does not vanish at any point. Firstly, choose *x, z* ∈ *K* such that *x* = *z*. Since *<sup>φ</sup>* is non-constant we can choose *a, b* <sup>∈</sup> <sup>R</sup> such that *φ(a)* = *φ(b)*. Next choose *<sup>A</sup>* <sup>∈</sup> *<sup>A</sup>q*<sup>0</sup> such that *A(x)* <sup>=</sup> *<sup>a</sup>* and *A(z)* <sup>=</sup> *<sup>b</sup>*. Then, *φ(A(x))* = *φ(A(z))* and *%q*<sup>0</sup> *(φ)* separates points. Secondly, since *<sup>φ</sup>* is non-constant, we can choose *<sup>a</sup>* <sup>∈</sup> <sup>R</sup> such that *φ(a)* = 0. Moreover, choose weight *w* = *(a,* 0*,...,* 0*)*- <sup>∈</sup> <sup>R</sup>*q*0<sup>+</sup>1. Then for this *<sup>A</sup>* <sup>∈</sup> *<sup>A</sup>q*<sup>0</sup> , *A(x)* <sup>=</sup>*w, <sup>x</sup>*<sup>=</sup> *<sup>a</sup>* for any *<sup>x</sup>* <sup>∈</sup> *<sup>K</sup>*. Henceforth, *φ(A(x))* = 0, therefore *%q*<sup>0</sup> *(φ)* does not vanish at any point of *K*. The claim then follows from the Stone–Weierstrass theorem and using that *φ* is continuous by assumption.

For Theorem 12.4 to hold, the activation function *φ* can be any continuous and non-constant function, i.e., it does not need to be a squashing function. This is fairly general, but it rules out the step function activation as it is not continuous. However, for squashing functions continuity is not needed and one still receives the uniformly dense on compacta property of *%q*<sup>0</sup> *(φ)* in *<sup>C</sup>(*R*q*<sup>0</sup> *)*, this has been proved in Theorem 2.3 of Hornik et al. [192]. The following theorem also does not need continuity, i.e., we do not require *<sup>q</sup>*<sup>0</sup> *(φ)* <sup>⊂</sup> *<sup>C</sup>(*R*q*<sup>0</sup> *)* as *<sup>φ</sup>* only needs to be measurable (and squashing).

**Theorem 12.5 (Universality, Theorem 2.4 in Hornik et al. [192])** *Assume φ is a squashing activation function. <sup>q</sup>*<sup>0</sup> *(φ) is uniformly dense on compacta in <sup>C</sup>(*R*q*<sup>0</sup> *).*

*Sketch of Proof* For the (continuous) cosine activation function choice cos*(*·*)*, Theorem 12.4 applies to *%q*<sup>0</sup> *(*cos*)*. Repeatedly applying the trigonometric identity cos*(a)* cos*(b)* = cos*(a* + *b)* − cos*(a* − *b)* allows us to rewrite any trigonometric polynomial <sup>D</sup>*lj <sup>k</sup>*=<sup>1</sup> cos*(Aj,k(x))* as *<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *αt* cos*(At(x))* for suitable *At* <sup>∈</sup> *<sup>A</sup>q*<sup>0</sup> , *αt* <sup>∈</sup> <sup>R</sup> and *<sup>T</sup>* <sup>∈</sup> <sup>N</sup>. This allows us to identify *<sup>q</sup>*<sup>0</sup> *(*cos*)* <sup>=</sup> *%q*<sup>0</sup> *(*cos*)*. As a consequence of Theorem 12.4, shallow FN networks *<sup>q</sup>*<sup>0</sup> *(*cos*)* are uniformly dense on compacta in *<sup>C</sup>(*R*q*<sup>0</sup> *)*.

The remaining part relies on approximating the cosine activation function. Firstly, Lemma A.2 of Hornik et al. [192] says that for any continuous squashing function *<sup>ψ</sup>* and any *<sup>&</sup>gt;* 0 there exists *<sup>H</sup> (x)* <sup>=</sup> *q*<sup>1</sup> *<sup>j</sup>*=<sup>1</sup> *βjφ(w<sup>j</sup>* <sup>0</sup> <sup>+</sup> *<sup>w</sup><sup>j</sup>* <sup>1</sup> *x)* <sup>∈</sup> <sup>1</sup>*(φ)*, *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>, such that

$$\sup\_{\mathbf{x}\in\mathbb{R}}|\psi(\mathbf{x}) - H\_{\epsilon}(\mathbf{x})| < \epsilon. \tag{12.1}$$

For the proof we refer to Lemma A.2 of Hornik et al. [192], it uses that *ψ* is a continuous squashing function, implying that for every *δ* ∈ *(*0*,* 1*)* there exists *m >* 0 such that *ψ(*−*m) < δ* and *ψ(m) >* <sup>1</sup> <sup>−</sup> *<sup>δ</sup>*. Approximation *H* <sup>∈</sup> <sup>1</sup>*(φ)* of *<sup>ψ</sup>* is then constructed on *(*−*m, m)* so that the error bound holds (and for *δ* sufficiently small).

Secondly, choose *<sup>&</sup>gt;* 0 and *M >* 0, there exists cos*M,* <sup>∈</sup> <sup>1</sup>*(φ)* such that

$$\sup\_{\chi \in [-M,M]} \left| \cos(\chi) - \cos\_{M,\epsilon}(\chi) \right| < \epsilon. \tag{12.2}$$

This is Lemma A.3 of Hornik et al. [192]; to prove this, we consider the cosine squasher of Gallant–White [150], for *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>

$$\chi(\boldsymbol{x}) = \frac{1}{2} \left( 1 + \cos \left( \boldsymbol{x} + \frac{3\pi}{2} \right) \right) \mathbb{1}\_{\{-\pi/2 \le \boldsymbol{x} \le \pi/2\}} + \mathbb{1}\_{\{\boldsymbol{x} > \pi/2\}} \in [0, 1].$$

This is a continuous squashing function. Adding, subtracting and scaling a *finite* number of affinely shifted versions of the cosine squasher *χ* can exactly replicate the cosine on [−*M,M*]. Claim (12.2) then follows from the fact that we need a finite number of cosine squashers *χ* to replicate the cosine on [−*M,M*], the triangle equality, and the fact that the (continuous) cosine squasher can be approximated arbitrarily well in <sup>1</sup>*(φ)* using (12.1).

The final step is to patch everything together. Consider *<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *αt* cos*(At(x))* which approximates on the compact set *<sup>K</sup>* ⊂ {1} × <sup>R</sup>*q*<sup>0</sup> a given continuous function *<sup>g</sup>* <sup>∈</sup> *<sup>C</sup>(*R*q*<sup>0</sup> *)* with a given tolerance */*2. Choose *M >* 0 such that *At(K)* ⊂ [−*M,M*] for all 1 ≤ *t* ≤ *T* . Note that this *M* can be found because *<sup>K</sup>* is compact, *At* are continuous and *<sup>T</sup>* is finite. Define *<sup>T</sup>* <sup>=</sup> *<sup>T</sup> <sup>T</sup> <sup>t</sup>*=<sup>1</sup> <sup>|</sup>*αt*<sup>|</sup> *<sup>&</sup>lt;* <sup>∞</sup>. By (12.2) we can then choose cos*M,/(*2*<sup>T</sup> )* <sup>∈</sup> <sup>1</sup>*(φ)* such that

$$\sup\_{\mathbf{x}\in K} \left| \sum\_{l=1}^{T} \alpha\_l \cos(A\_l(\mathbf{x})) - \sum\_{l=1}^{T} \alpha\_l \cos\_{M,\epsilon/(2T')}(A\_l(\mathbf{x})) \right| < \epsilon/2.$$

This completes the proof.

**12.2 Consistency and Asymptotic Normality**

Universality Theorem 12.5 tells us that we can approximate any compactly supported continuous function arbitrarily well by a sufficiently large shallow FN network, say, with sigmoid activation function *φ*. The next natural question is whether we can *learn* these approximations from data *(Yi, xi)i*≥<sup>1</sup> that follow the true but unknown regression function *x* → *μ*0*(x)*, or in other words whether we have consistency for a certain class of learning methods. This is the question addressed, e.g., in White [379, 380], Barron [26], Chen–Shen [73], Döhler–Rüschendorf [109] and Shen et al. [336]. This turns the algebraic universality question into a statistical question about consistency.

Assume that the true data model satisfies

$$Y = \mu\_0(\mathbf{x}) + \varepsilon = \mathbb{E}[Y|\mathbf{x}] + \varepsilon,\tag{12.3}$$

for a continuous regression function *<sup>μ</sup>*<sup>0</sup> : *<sup>X</sup>* <sup>→</sup> <sup>R</sup> on a compact set *<sup>X</sup>* ⊂ {1} × <sup>R</sup>*q*<sup>0</sup> , and with a centered error *<sup>ε</sup>* satisfying <sup>E</sup>[|*ε*<sup>|</sup> <sup>2</sup>+*δ*] *<sup>&</sup>lt;* <sup>∞</sup> for some *δ >* 0 and being independent of *x*. The question now is whether we can learn this (true) regression function *μ*<sup>0</sup> from independent data *(Yi, xi)*, 1 ≤ *i* ≤ *n*, obeying (12.3). Throughout this section we use the square error loss function *L(y, a)* <sup>=</sup> *(y* <sup>−</sup> *a)*2. For given data, this results in solving

$$\widetilde{\mu}\_n = \operatorname\*{arg\,min}\_{\mu \in \mathcal{C}(\mathcal{X})} \frac{1}{n} \sum\_{l=1}^n L\left(Y\_l, \mu(\mathbf{x}\_l)\right) \\ = \operatorname\*{arg\,min}\_{\mu \in \mathcal{C}(\mathcal{X})} \frac{1}{n} \sum\_{l=1}^n \left(Y\_l - \mu(\mathbf{x}\_l)\right)^2,\qquad(12.4)$$

where *C(X)* denotes the set of continuous functions on the compact set *X* ⊂ {1}×R*q*0. The main question is whether estimator \**μn* approaches the true regression function *μ*<sup>0</sup> for increasing sample size *n*.

Typically, the family of continuous functions *C(X)* is much too rich to be able to solve optimization problem (12.4), and the solution may have undesired properties. In particular, the solution to (12.4) will over-fit to the data for any sample size *n*, and consistency will not hold, see, e.g., Section 2.2.1 in Chen [72]. Therefore, the optimization needs to be done over (well-chosen) smaller sets *S<sup>n</sup>* ⊂ *C(X)*. For instance, *S<sup>n</sup>* can be the set of shallow FN networks having a maximal width *q*<sup>1</sup> = *q*1*(n)*, depending on the sample size *n* of the data. Considering this regression problem in a non-parametric sense, we let grow these sets *S<sup>n</sup>* with the sample size *n*. This idea is attributed to Grenander [172] and it is called the *method of sieve estimators* of *<sup>μ</sup>*0. We define for *<sup>d</sup>* <sup>∈</sup> <sup>N</sup>,  *>* 0,  *>* \* <sup>0</sup> and activation function *<sup>φ</sup>*

$$\mathcal{S}(d,\Delta,\widetilde{\Delta},\phi) = \left\{ f \in \Sigma^{q\_0}(\phi) \colon q\_1 = d, \ \sum\_{j=0}^{q\_1} |\beta\_j| \le \Delta, \ \max\_{1 \le j \le q\_1} \sum\_{l=0}^{q\_0} |w\_{l,j}| \le \widetilde{\Delta} \right\}.$$

These sets *<sup>S</sup>(d, , , φ)* \* are shallow FN networks of a given width *<sup>q</sup>*<sup>1</sup> <sup>=</sup> *<sup>d</sup>* and with some restrictions on the network parameters.1 We then choose increasing sequences

<sup>1</sup> The bound *q*<sup>1</sup> *<sup>j</sup>*=<sup>0</sup> <sup>|</sup>*βj* | ≤ in *<sup>S</sup>(d, , , φ)* \* allows us to view this set of shallow FN networks as a symmetric convex hull of the family of functions *<sup>S</sup>*0*(φ)* <sup>=</sup> {*<sup>x</sup>* <sup>→</sup> *φ(A(x))*; *<sup>A</sup>* <sup>∈</sup> *<sup>A</sup>q*<sup>0</sup> }, see Sect. 2.6.3 in Van der Vaart–Wellner [364]. If we choose an increasing activation function *φ*, this family of functions *φ* ◦*A* is a composition of a fixed increasing function *φ* and a finite dimensional vector space *<sup>A</sup>q*<sup>0</sup> of functions *<sup>A</sup>*. This implies that *<sup>S</sup>*0*(φ)* is a VC-class saying that it has a finite Vapnik–Chervonenkis (VC) dimension [365]; see also Condition A and Theorem 2.1 in Döhler– Rüschendorf [109]. This VC-class is an important property in many proofs as it leads to a finite covering (metric entropy) of function spaces, and this allows to apply limit theorems to point processes, we refer to Van der Vaart–Wellner [364].

*(dn)n*≥1, *(n)n*≥<sup>1</sup> and *(*\**n)n*≥<sup>1</sup> which provides us with an increasing sequence of sieves (becoming finer as *n* increases)

$$\ldots \quad \subseteq \mathcal{S}\_n(\phi) \stackrel{\text{def.}}{=} \mathcal{S}(d\_n, \Delta\_n, \widetilde{\Delta}\_n, \phi) \subseteq \mathcal{S}\_{n+1}(\phi) \stackrel{\text{def.}}{=} \mathcal{S}(d\_{n+1}, \Delta\_{n+1}, \widetilde{\Delta}\_{n+1}, \phi) \subseteq \ldots \ldots$$

The following corollary is a simple consequence of Theorem 12.5.

**Corollary 12.6** *Assume φ is a squashing activation function, and let the increasing sequences* H *(dn)n*≥1*, (n)n*≥<sup>1</sup> *and (*\**n)n*≥<sup>1</sup> *tend to infinity for <sup>n</sup>* → ∞*. Then <sup>n</sup>*≥<sup>1</sup> *<sup>S</sup>n(φ) is uniformly dense in <sup>C</sup>(X).*

This corollary says that for any regression function*μ*<sup>0</sup> <sup>∈</sup> *<sup>C</sup>(X)* we can find *<sup>n</sup>* <sup>∈</sup> <sup>N</sup> and *μn* ∈ *Sn(φ)* such that *μn* is arbitrarily close to *μ*0; remark that all functions are continuous on the compact set *X*, and uniformly dense means *ρ<sup>X</sup>* -dense in that case. Corollary 12.6 does not hold true if *n* ≡  *>* 0, for all *n*. In that case we can only approximate the smaller function class H *<sup>n</sup>*≥<sup>1</sup> *<sup>S</sup>n(φ)* <sup>⊂</sup> *<sup>C</sup>(X)*. This is going to be used in one of the cases, below.

For increasing sequences *(dn)n*≥1, *(n)n*≥<sup>1</sup> and *(*\**n)n*≥<sup>1</sup> we define the sieve estimator *( μn)n*≥<sup>1</sup> by

$$\widehat{\mu}\_n = \underset{\mu \in \mathcal{S}\_n(\phi)}{\text{arg min}} \frac{1}{n} \sum\_{i=1}^n L\left(Y\_i, \mu(\mathbf{x}\_i)\right). \tag{12.5}$$

Under the following assumptions one can prove a consistency theorem.

**Assumption 12.7** *Choose a complete probability space (-, <sup>A</sup>,* <sup>P</sup>*)*<sup>2</sup> *and <sup>X</sup>* = {1} × [0*,* 1] *q*0*.*


Most results that we are going to present below hold for activation functions that are Lipschitz. The sigmoid activation function is Lipschitz, see Lemma 12.2.

The following considerations are based on the pseudo-norm, given *(Xi)*<sup>1</sup>≤*i*≤*n*,

$$\|\mu\|\_{n} = \sqrt{\frac{1}{n} \sum\_{i=1}^{n} \left(\mu(X\_{i})\right)^{2}} \qquad \text{for } \mu \in \mathcal{C}(\mathcal{X}).$$

<sup>2</sup> A probability space *(-, <sup>A</sup>,* <sup>P</sup>*)* is complete if for any <sup>P</sup>-null set *<sup>B</sup>* <sup>∈</sup> *<sup>A</sup>* with <sup>P</sup>[*B*] = <sup>0</sup> and every subset *A* ⊂ *B* it follows that *A* ∈ *A*.

This is a pseudo-norm because it is positive *μ<sup>n</sup>* ≥ 0, absolutely homogeneous *aμ<sup>n</sup>* = |*a*| *μ<sup>n</sup>* and the triangle inequality holds, but it is not definite because *μ<sup>n</sup>* = 0 does not imply that *μ* is the zero function (i.e. it is not point-separating). This pseudo-norm ·*<sup>n</sup>* depends on the (random) features *(Xi)*<sup>1</sup>≤*i*≤*<sup>n</sup>* and, therefore, the subsequent statements involving this pseudo-norm hold in probability. The following result provides consistency, and that the true regression function *μ*0, indeed, can be learned from i.i.d. data.

**Theorem 12.8 (Consistency, Theorem 3.1 of Shen et al. [336])** *Under Assumption 12.7, the sieve estimator ( μn)n*≥<sup>1</sup> *in* (12.5) *exists. We have consistency μn* <sup>−</sup> *<sup>μ</sup>*<sup>0</sup>*<sup>n</sup>* <sup>→</sup> <sup>0</sup> *in probability as <sup>n</sup>* → ∞*, i.e., for all <sup>&</sup>gt;* <sup>0</sup>

$$\lim\_{n \to \infty} \mathbb{P}\left[ \| \widehat{\mu}\_n - \mu\_0 \|\_n > \epsilon \right] = 0.$$

#### *Remarks 12.9*


*Sketch of Proof of Theorem 12.8* The proof of this theorem is based on a theorem in White–Woolridge [381] which states that if we have a sequence *(Sn(φ))n*≥<sup>1</sup> of compact subsets of *C(X)*, and if *Ln* : *-* <sup>×</sup> *<sup>S</sup>n(φ)* <sup>→</sup> <sup>R</sup> is a *<sup>A</sup>* <sup>⊗</sup> *<sup>B</sup>(Sn(φ))/B(*R*)* measurable sequence, *n* ≥ 1, with *Ln(ω,*·*)* being lower-semicontinuous on *Sn(φ)* for all *ω* ∈ *-*. Then, there exists *μn* : *-* → *Sn(φ)* being *A/B(Sn(φ))*-measurable such that for each *ω* ∈ *-*, *Ln(ω, μn(ω))* <sup>=</sup> min *μ*∈*Sn(φ) Ln(ω, μ)*. For the proof of the compactness of *Sn(φ)* in *C(X)* we need that *dn* and *n* are finite for any *n*. This then provides the existence of the sieve estimator, for details we refer Lemma 2.1 and Corollary 2.1 in Shen et al. [336]. The proof of the consistency result then uses the growth rates on *(dn)n*≥<sup>1</sup> and *(n)n*≥1, for the details of the proof we refer to Theorem 3.1 in Shen et al. [336].

The next step is to analyze the rates of convergence of the sieve estimator *μn* <sup>→</sup> *<sup>μ</sup>*0, as *<sup>n</sup>* → ∞. These rates heavily depend on (additional) regularity assumptions on the true regression function *μ*<sup>0</sup> ∈ *C(X)*; we refer to Remark 3 in Sect. 5 of Chen–Shen [73]. Here, we present some results of Shen et al. [336]. From the proof of Theorem 12.8 we know that *Sn(φ)* is a compact set in *C(X)*. This motivates to consider the closest approximation *πnμ* ∈ *Sn(φ)* to *μ* ∈ *C(X)*. The uniform denseness of H *<sup>n</sup>*≥<sup>1</sup> *<sup>S</sup>n(φ)* in *<sup>C</sup>(X)* implies that *πnμ* converges to *<sup>μ</sup>*. The aforementioned rates of convergence of the sieve estimators will depend on how fast *πnμ*<sup>0</sup> ∈ *Sn(φ)* converges to the true regression function *μ*<sup>0</sup> ∈ *C(X)*.

If one cannot determine the global minimum of (12.5), then often an accurate approximation is sufficient. For this one introduces an approximate sieve estimator. A sequence *( μn)n*≥<sup>1</sup> is called an *approximate sieve estimator* if

$$\frac{1}{n}\sum\_{l=1}^{n}(Y\_{l}-\widehat{\mu}\_{n}(X\_{l}))^{2} \leq \inf\_{\mu \in \mathcal{S}\_{n}(\phi)} \frac{1}{n} \sum\_{l=1}^{n}(Y\_{l}-\mu(X\_{l}))^{2} + O\_{P}(\eta\_{n}),\tag{12.6}$$

where *(ηn)n*≥<sup>1</sup> is a positive sequence converging to 0 as *n* → ∞. The last term *OP (ηn)* denotes stochastic boundedness meaning that for all  *>* 0 there exits *K >* 0 such that for all *n* ≥ 1

$$\mathbb{P}\left[\frac{1}{n}\sum\_{l=1}^{n}(Y\_{l}-\widehat{\mu}\_{n}(X\_{l}))^{2}-\inf\_{\mu\in\mathcal{S}\_{n}(\phi)}\frac{1}{n}\sum\_{l=1}^{n}(Y\_{l}-\mu(X\_{l}))^{2}>K\_{\epsilon}\eta\_{n}\right]<\epsilon.$$

**Theorem 12.10 (Theorem 4.1 of Shen et al. [336], Without Proof)** *Set Assumption 12.7. If*

$$\eta\_n = O\left(\min\left\{\left\|\pi\_n\mu\_0 - \mu\_0\right\|\_n^2, \frac{d\_n\log(d\_n\Delta\_n)}{n}, \frac{d\_n\log n}{n}\right\}\right),$$

*the following stochastic boundedness holds for n* ≥ 1

$$\|\|\widehat{\mu}\_n - \mu\_0\|\|\_n = O\_P\left(\max\left\{\|\pi\_n\mu\_0 - \mu\_0\|\_n, \sqrt{\frac{d\_n \log n}{n}}\right\}\right).$$

*Remarks 12.11*

• Assumption 12.7 implies that *dn* log*(dnn)* = *o(n)* as *n* → ∞. Therefore, *ηn* → 0 as *n* → ∞.


$$\|\widehat{\mu}\_n - \mu\_0\|\_n = O\_P(r\_n^{-1}),$$

for

$$r\_n = \left(\frac{n}{\log n}\right)^{(q\_0+1)/(4q\_0+2)} \qquad n \ge 2. \tag{12.7}$$

Note that 1*/*4 ≤ *(q*<sup>0</sup> + 1*)/(*4*q*<sup>0</sup> + 2*)* ≤ 1*/*2. Thus, this is a slower rate than the square root rule of typical asymptotic normality, for instance, for *q*<sup>0</sup> = 1 we get <sup>1</sup>*/*3. Interestingly, Barron [26] proposes the choice *dn* <sup>∼</sup> *(n/* log *n)*1*/*<sup>2</sup> to receive an approximation rate of *(n/* log *n)*−1*/*4.

Also note that the space *F(X)* allows us to choose a finite *n* ≡  *>* 0 in the sieves, thus, here we do not receive denseness of the sieves in the space of continuous functions *C(X)*, but only in the space of functions with finite first absolute moments of the Fourier magnitude distributions *F(X)*.

The last step is to establish the asymptotic normality. For this we have to define perturbations of shallow FN networks *μ* ∈ *Sn(φ)*. Choose *ηn* ∈ *(*0*,* 1*)* and define the function

$$
\widetilde{\mu}\_n(\mu) = (1 - \eta\_n^{1/2})\mu + \eta\_n^{1/2}(\mu\_0 + 1).
$$

This allows us to state the following asymptotic normality result.

**Theorem 12.12 (Theorem 5.1 of Shen et al. [336], Without Proof)** *Set Assumption 12.7. We make the following additional assumptions: suppose ηn* <sup>=</sup> *o(n*−1*) and choose n such that we have stochastic boundedness n μn* <sup>−</sup> *<sup>μ</sup>*<sup>0</sup>*<sup>n</sup>* <sup>=</sup> *OP (*1*). Let the following conditions hold:*

$$\begin{array}{l} \text{(C1)} \ d\_{n} \Delta\_{n} \log(d\_{n} \Delta\_{n}) = o(n^{1/4}); \\ \text{(C2)} \ n \varrho\_{n}^{-2}/\Delta\_{n}^{\delta} = o(1); \end{array}$$

$$(\mathbf{C}\mathbf{3})\sup\_{\mu\in\mathcal{S}\_n(\phi):\|\mu-\mu\_0\|\_n\leq\_{\mathcal{Q}\_n}}\|\pi\_n\widetilde{\mu}\_n(\mu)-\widetilde{\mu}\_n(\mu)\|\_n = O\_P(\varrho\_n\eta\_n);$$

$$(\mathbf{C4})\ \sup\_{\mu \in \mathcal{S}\_n(\phi) : \|\mu - \mu\_0\|\_{\mathbb{R}} \le\_{\mathcal{Q}\_n} ^{-1}} \frac{1}{n} \sum\_{l=1}^n \varepsilon\_l \left(\pi\_n \widetilde{\mu}\_n(\mu)(X\_l) - \widetilde{\mu}\_n(\mu)(X\_l)\right) = O\_P(\eta\_n).$$

*We have the following asymptotic normality for n* → ∞

$$\frac{1}{\sqrt{n}}\sum\_{i=1}^n \left(\widehat{\mu}\_n(X\_i) - \mu\_0(X\_i)\right) \Rightarrow \mathcal{N}\left(0, \sigma^2\right).$$

The assumptions of Theorem 12.12 require a slower growth rate *dn* on the shallow FN network compared to the consistency results. Shen et al. [336] bring forward the argument that for the asymptotic normality result to hold, the shallow FN network should grow slower in order to get the Gaussian property, otherwise the sieve estimator may skew towards the true function *μ*0. Conditions (C3)–(C4) on the other side give lower growth rates on the networks such that the approximation error decreases sufficiently fast.

If the variance parameter *<sup>σ</sup>*<sup>2</sup> <sup>=</sup> Var*(εi)* is not known, we can empirically estimate it

$$
\widehat{\sigma}\_n^2 = \frac{1}{n} \sum\_{i=1}^n \left( Y\_i - \widehat{\mu}\_n(X\_i) \right)^2.
$$

Theorem 5.2 in Shen et al. [336] proves that this estimator is consistent for *σ*2, and the asymptotic normality result also holds true under this estimated variance parameter (using Slutsky's theorem), and under the same assumptions as in Theorem 12.12.

## **12.3 Functional Limit Theorem**

Horel–Giesecke [190] push the above asymptotic results even one step further. Note that the asymptotic normality of Theorem 12.12 is not directly useful for variable selection, since the asymptotic result integrates over the feature space *X*. Horel– Giesecke [190] prove a functional limit theorem which we briefly review in this section.

A *q*0-tuple *α* = *(α*1*,...,αq*<sup>0</sup> *)*- <sup>∈</sup> <sup>N</sup>*q*<sup>0</sup> <sup>0</sup> is called a multi-index, and we set |*α*| = *α*<sup>1</sup> + *...* + *αq*<sup>0</sup> . Define the derivative operator

$$\nabla^{\alpha} = \frac{\partial^{|\alpha|}}{\partial x\_1^{\alpha\_1} \cdots \partial x\_{q\_0}^{\alpha\_{q\_0}}}.$$

Consider the compact feature space *X* = {1}×[0*,* 1] *<sup>q</sup>*<sup>0</sup> with *<sup>q</sup>*<sup>0</sup> <sup>≥</sup> 3. Choose a distribution *<sup>ν</sup>* on this feature space *<sup>X</sup>* and define the *<sup>L</sup>*2-space

$$L^2(\mathcal{X}, \nu) = \left\{ \mu : \mathcal{X} \to \mathbb{R} \text{ measurable}; \ \mathbb{E}\_{\nu}[\mu(X)^2] = \int\_{\mathcal{X}} \mu(\mathbf{x})^2 d\nu(\mathbf{x}) < \infty \right\}.$$

Next, define the Sobolev space for *<sup>k</sup>* <sup>∈</sup> <sup>N</sup>

$$W^{k,2}(\mathcal{X}, \nu) = \left\{ \mu \in L^2(\mathcal{X}, \nu) ; \ \nabla^{\mu}\mu \in L^2(\mathcal{X}, \nu) \text{ for all } \mu \in \mathbb{N}\_0^{\otimes 0} \text{ with } |\alpha| \le k \right\},$$

where <sup>∇</sup>*αμ* is the weak derivative of *<sup>μ</sup>*. The motivation for studying Sobolev spaces is that for sufficiently large *<sup>k</sup>* and the existence of weak derivatives <sup>∇</sup>*αμ* <sup>∈</sup> *<sup>L</sup>*2*(X,ν)*, <sup>|</sup>*α*| ≤ *<sup>k</sup>*, we eventually receive a classical derivative of *<sup>μ</sup>*, see below. We define the Sobolev norm for *<sup>μ</sup>* <sup>∈</sup> *<sup>W</sup>k,*2*(X,ν)* by

$$\|\mu\|\_{k,2} = \left(\sum\_{|\alpha| \le k} \mathbb{E}\_{\boldsymbol{\nu}} \left[ \left( \nabla^{\alpha} \mu(\boldsymbol{X}) \right)^{2} \right] \right)^{1/2}.$$

The normed Sobolev space *(Wk,*2*(X, p),* ·*k,*2*)* is a Hilbert space. Since we would like to consider gradient-based methods, we consider the following space

$$\mathcal{L}\_B^1(\mathcal{X}, \nu) = \left\{ \mu : \mathcal{X} \to \mathbb{R} \text{ continuously differentiable; } \ \|\mu\|\_{\lfloor q\_0/2 \rfloor + 2, 2} \le B \right\},\tag{12.8}$$

for some positive constant *B <* ∞. We will assume that the true regression function *<sup>μ</sup>*<sup>0</sup> <sup>∈</sup> *<sup>C</sup>*<sup>1</sup> *<sup>B</sup>(X,ν)*, thus, the true regression function has a bounded Sobolev norm ·'*q*0*/*2(+2*,*<sup>2</sup> of maximal size *B*. Assume that *X* ˚ <sup>⊂</sup> <sup>R</sup>*q*<sup>0</sup> is the open interior of *<sup>X</sup>* (excluding the intercept component), and that *ν* is absolutely continuous w.r.t. the Lebesgue measure with a strictly positive and bounded density on *X* (excluding the intercept component). The Sobolev number of the space *<sup>W</sup>*'*q*0*/*2(+2*,*2*(<sup>X</sup>* ˚*,ν)* is given by *m* = '*q*0*/*2( + 2 − *q*0*/*2 ≥ 1*.*5 *>* 1. The Sobolev embedding theorem then tells us that for any function *<sup>μ</sup>* <sup>∈</sup> *<sup>W</sup>*'*q*0*/*2(+2*,*2*(<sup>X</sup>* ˚*,ν)*, there exists an '*m*( times continuously differentiable function on *X* ˚ that is equal to *μ* a.e., thus, the class of equivalent functions *<sup>μ</sup>* <sup>∈</sup> *<sup>W</sup>*'*q*0*/*2(+2*,*2*(<sup>X</sup>* ˚*,ν)* has a representative in *<sup>C</sup>*1*(<sup>X</sup>* ˚*)*, '*m*( = 1, this motivates the consideration of the space in (12.8).

In practice, the bound *B* needs a careful consideration because the true *μ*<sup>0</sup> is unknown. Therefore, *B* should be sufficiently large so that *μ*<sup>0</sup> is contained in the space *<sup>C</sup>*<sup>1</sup> *<sup>B</sup>(X,ν)* and, on the other hand, it should not be too large as this will weaken the power of the tests, below.

We choose the sigmoid activation function for *φ* and we consider the approximate sieve estimators *( μn)n*≥<sup>1</sup> for given data *(Yi, <sup>X</sup>i)i* obtained by a solution to

$$\frac{1}{n}\sum\_{l=1}^{n}\left(Y\_{l}-\widehat{\mu}\_{n}(X\_{l})\right)^{2}\leq\inf\_{\mu\in\mathcal{S}\_{\hbar}(\phi)}\frac{1}{n}\sum\_{l=1}^{n}\left(Y\_{l}-\mu(X\_{l})\right)^{2}+o\_{P}(1),\tag{12.9}$$

where we allow for an error term *oP (*1*)* that converges in probability to zero as *n* → ∞. In contrast to (12.6) we do not specify the error rate, here.

**Assumption 12.13** *Choose a complete probability space (-, <sup>A</sup>,* <sup>P</sup>*) and <sup>X</sup>* = {1} × [0*,* 1] *q*0*.*


$$\begin{split} &\frac{1}{\sqrt{n}}\sum\_{i=1}^{n} \left( L\_{\widehat{\mu}\_{\boldsymbol{n}}}(X\_{i}, \boldsymbol{\varepsilon}\_{i}) - \mathbb{E}\_{\boldsymbol{\nu}} \left[ L\_{\widehat{\mu}\_{\boldsymbol{n}}}(X\_{1}, \boldsymbol{\varepsilon}\_{1}) \right] \right) \\ &\leq \inf\_{h \in \mathcal{C}\_{\boldsymbol{h}}^{1}(\mathcal{X}, \boldsymbol{\upsilon})} \frac{1}{\sqrt{n}} \sum\_{i=1}^{n} \left( L\_{\mu\_{0} + h/r\_{\boldsymbol{n}}}(X\_{i}, \boldsymbol{\varepsilon}\_{i}) - \mathbb{E}\_{\boldsymbol{\nu}} \left[ L\_{\mu\_{0} + h/r\_{\boldsymbol{n}}}(X\_{1}, \boldsymbol{\varepsilon}\_{1}) \right] \right) + o\_{P}(r\_{\boldsymbol{n}}^{-1}), \end{split}$$

*for rn being the rate defined in* (12.7)*.*

The first three items of this assumption are rather similar to Assumption 12.7 which provides consistency in Theorem 12.8 and the rates of convergence in Theorem 12.10. Item (4) of Assumption 12.13 needs to be compared to (C3)– (C4) of Theorem 12.12 which is used for getting the asymptotic normality. *(rn)n* is the rate that provides convergence in probability of the sieve estimator to the true regression function, and this magnitude is used for the perturbation, see also (C3)– (C4) in Theorem 12.12.

**Theorem 12.14 (Asymptotics, Theorem 1 of Horel–Gisecke [190], Without Proof)** *Under Assumption 12.13 the approximate sieve estimator ( μn)n*≥<sup>1</sup> (12.9) *converges weakly in the metric space (C*<sup>1</sup> *<sup>B</sup>(X, ν), dν ) with dν(μ, μ )* <sup>=</sup> <sup>E</sup>*<sup>ν</sup>* [*(μ(X)*<sup>−</sup> *μ (X))*2]*:*

$$r\_n \left(\widehat{\mu}\_n - \mu\_0\right) \Rightarrow \mu^\star \qquad a \approx n \to \infty,$$

*where <sup>μ</sup># is the* arg max *of the Gaussian process* {*Gμ*; *<sup>μ</sup>* <sup>∈</sup> *<sup>C</sup>*<sup>1</sup> *<sup>B</sup>(X,ν)*} *with mean zero and covariance function* Cov*(Gμ, Gμ)* <sup>=</sup> <sup>4</sup>*σ*2E*<sup>ν</sup>* [*μ(X)μ (X)*]*.*

*Remarks 12.15* We highlight the differences between Theorems 12.12 and 12.14.


## **12.4 Hypothesis Testing**

Theorem 12.14 can be used to provide a significance test for feature component selection, similarly to the LRT and the Wald test presented in Sect. 5.3.2 on GLMs. We define gradient-based test statistics, for 1 ≤ *j* ≤ *q*0, and w.r.t. the approximate sieve estimator *μn* <sup>∈</sup> *<sup>S</sup>n(φ)* given in (12.9),

$$
\boldsymbol{\Lambda}\_{j}^{(n)} = \int\_{\mathcal{X}} \left( \frac{\partial \widehat{\mu}\_{n}(\mathbf{x})}{\partial \mathbf{x}\_{j}} \right)^{2} d\boldsymbol{\nu}(\mathbf{x}) \qquad \text{and} \qquad \widehat{\boldsymbol{\Lambda}}\_{j}^{(n)} = \frac{1}{n} \sum\_{l=1}^{n} \left( \frac{\partial \widehat{\mu}\_{n}(\mathbf{X}\_{l})}{\partial \mathbf{x}\_{j}} \right)^{2}.
$$

The test statistics *(n) <sup>j</sup>* integrates the squared partial derivative of the sieve estimator *μn* w.r.t. the distribution *<sup>ν</sup>*, whereas *(n) <sup>j</sup>* can be considered as its empirical counterpart if *X* ∼ *ν*. Note that both test statistics depend on the data *(Yi, Xi)*<sup>1</sup>≤*i*≤*<sup>n</sup>* determining the sieve estimator *μn*, see (12.9). These test statistics are used to test the following null hypothesis *H*<sup>0</sup> against the alternative hypothesis *H*<sup>1</sup> for the true regression function *<sup>μ</sup>*<sup>0</sup> <sup>∈</sup> *<sup>C</sup>*<sup>1</sup> *<sup>B</sup>(X,ν)*

$$H\_0: \lambda\_j = \mathbb{E}\_{\nu} \left[ \left( \frac{\partial \mu\_0(X)}{\partial x\_j} \right)^2 \right] = 0 \qquad \text{against} \qquad H\_1: \lambda\_j \neq 0. \tag{12.10}$$

We emphasize that the expression *λj* in (12.10) is a deterministic number, for this reason we use the expected value notation <sup>E</sup>*<sup>ν</sup>* [·]. This in contrast to *(n) <sup>j</sup>* , which is only a conditional expectation, conditionally given the data *(Yi, Xi)*<sup>1</sup>≤*i*≤*n*.

**Proposition 12.16 (Theorem 2 and Proposition 3 of Horel–Giesecke [190], Without Proof)** *Under Assumption 12.13 and under the null hypothesis H*<sup>0</sup> *we have for n* → ∞

$$r\_n^2 \Lambda\_j^{(n)}, r\_n^2 \widehat{\Lambda}\_j^{(n)} \Rightarrow \Psi\_j \stackrel{\text{def.}}{=} \int\_{\mathcal{X}} \left(\frac{\partial \mu^\star(\mathbf{x})}{\partial \mathbf{x}\_j}\right)^2 d\boldsymbol{\upsilon}(\mathbf{x}).\tag{12.11}$$

In order to use this proposition we need to be able to calculate the limiting distribution characterized by random variable *j* . The maximal argument *μ#* of the Gaussian process {*Gμ*; *<sup>μ</sup>* <sup>∈</sup> *<sup>C</sup>*<sup>1</sup> *<sup>B</sup>(X,ν)*} is given by a random function such that for all *ω* ∈ *-*, *μ# ω(*·*)* fulfills

$$G\_{\mu\_{\alpha}^{\bullet}(\cdot)}(\omega) \ge G\_{\mu}(\alpha) \qquad \text{ for all } \mu \in \mathcal{C}\_{B}^{\perp}(\mathcal{X}, \nu).$$

A discretization and simulation approach can be explored to approximate this maximal argument*μ#* for different *<sup>ω</sup>* <sup>∈</sup> *-*, see Section 5.7 in Horel–Giesecke [190].


$$\widehat{\Sigma} = \left( \frac{1}{n} \sum\_{i=1}^{n} f\_k(X\_i) f\_l(X\_i) \right)\_{1 \le k, l \le K}$$

*.*

These random variables *G(*1*) ,..., G(T )* play the role of discretized random samples of the Gaussian process {*Gμ*; *<sup>μ</sup>* <sup>∈</sup> *<sup>C</sup>*<sup>1</sup> *<sup>B</sup>(X,ν)*}.

3. The empirical arg max of the sample *<sup>G</sup>(t )*, 1 <sup>≤</sup> *<sup>t</sup>* <sup>≤</sup> *<sup>T</sup>* , is obtained by

$$
\widehat{\mu}\_{\mathfrak{l}}^{\star} = \underset{f\_k \colon 1 \le k \le K}{\text{arg}\, \mathbf{max}} \; G\_{f\_k}^{(\mathfrak{l})},
$$

where *G(t ) fk* is the *<sup>k</sup>*-th component of *<sup>G</sup>(t )*. 4. The empirical distribution of the following sample *(t ) <sup>j</sup>* , 1 ≤ *t* ≤ *T* , gives us an approximation to the limiting distribution in Proposition 12.16

$$
\widehat{\Psi}\_j^{(l)} = \frac{1}{n} \sum\_{i=1}^n \left( \frac{\partial \widehat{\mu}\_l^\star(X\_i)}{\partial x\_j} \right)^2,
$$

i.e., under the null hypothesis *H*<sup>0</sup> we approximate the right-hand side of (12.11) by the empirical distribution of *( (t ) <sup>j</sup> )*1≤*t*≤*<sup>T</sup>* .

We close this section we some remarks.

*Remarks 12.17*


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 13 Appendix B: Data and Examples**

This appendix presents and describes the data sets used.

## **13.1 French Motor Third Party Liability Data**

We consider a French motor third party liability (MTPL) claims data set. This data set is available through the R library CASdatasets<sup>1</sup> being hosted by Dutang– Charpentier [113]. The specific data sets chosen from CASdatasets are called FreMTPL2freq and FreMTPL2sev, the former contains the insurance policy and claim frequency information and the latter the corresponding claim severity information.<sup>2</sup>

Before we can work with this data set we perform data cleaning. It has been pointed out by Loser [259] that the claim counts on the insurance policies with policy IDs ≤ 24500 in FreMTPL2freq do not seem to be correct because these claims do not have claim severity counterparts in FreMTPL2sev. For this reason we work with the claim counts extracted from the latter file. In Listing 13.1 we give the code used for data cleaning.<sup>3</sup> In this code we merge FreMTPL2freq with the aggregated severities on each insurance policy and the corresponding claim counts are received from FreMTPL2sev, this is done on lines 2–11 of Listing 13.1. A

<sup>1</sup> CASdatasets website: http://cas.uqam.ca/.

<sup>2</sup> We use CASdatasets version 1.0–8 which has been packaged on 2018-05-20. This version uses for the 22 French regions the labels R11*,...,* R94. In later versions of CASdatasets these labels have been replaced by the region names, in this transformation the labels R31 (Nord-Pasde-Calais) and R41 (Lorraine) have been merged to one region called Nord-Pas-de-Calais. We believe that this is an error and therefore prefer to work with an older version of CASdatasets. This older version can be downloaded in R with library(OpenML), library(farff), freMTPL2freq <- getOMLDataSet(data.id = 41214)\$data

<sup>3</sup> The code in Listing 13.1 is a modified version of the R code provided by Loser [259].

further inspection of the data indicates that policies with more than 5 claims may be data error because they all seem to belong to the same driver (and they have very short exposures).<sup>4</sup> For this reason we drop these records on line 12. On line 13 we censor exposures at one accounting year (since these policies are active within one calendar year). Finally, on lines 15–16 we re-level the VehBrands.<sup>5</sup> All subsequent analysis is based on this cleaned data set.

**Listing 13.1** Data cleaning applied to the French MTPL data set

```
1 #
2 data(freMTPL2freq)
3 dat <- freMTPL2freq[, -2]
4 dat$VehGas <- factor(dat$VehGas)
5 data(freMTPL2sev)
6 sev <- freMTPL2sev
7 sev$ClaimNb <- 1
8 dat0 <- aggregate(sev, by=list(IDpol=sev$IDpol), FUN = sum)[c(1,3:4)]
9 names(dat0)[2] <- "ClaimTotal"
10 dat <- merge(x=dat, y=dat0, by="IDpol", all.x=TRUE)
11 dat[is.na(dat)] <- 0
12 dat <- dat[which(dat$ClaimNb <=5),]
13 dat$Exposure <- pmin(dat$Exposure, 1)
14 sev <- sev[which(sev$IDpol %in% dat$IDpol), c(1,2)]
15 dat$VehBrand <- factor(dat$VehBrand, levels=c("B1","B2","B3","B4","B5","B6",
16 "B10","B11","B12","B13","B14"))
```
**Listing 13.2** Excerpt of the French MTPL data set


Listing 13.2 gives an excerpt of the cleaned French MTPL data set, lines 2– 14 give the insurance policy and claim counts information, and lines 17–18

Short exposure policies may also belong to a commercial car rental company.

 The data set FreMTPLfreq of CASdatasets is a subset of FreMTPL2freq with slightly changed feature components, for instance, the former data set contains car brand names in a more aggregated version than the latter, see Table 13.2, below.

display the individual claim amounts. We have 9 feature components on lines 4– 12 (1 component is binary, 3 components are categorical, and 5 components are continuous), an exposure variable on line 3, and claim information on lines 13–14 and 18. In total we have 26'383 claims on 678'007 insurance policies.

We start by giving a descriptive analysis of the data, this closely follows Noll et al. [287]. We have the following insurance policy information:


We start by describing the Exposure. The Exposure measures the duration of an insurance policy in yearly units; sometimes it is also called *years-at-risk*. The shortest exposure in our data set is 0.0027 which corresponds to 1 day, and the longest exposure is 1 which corresponds to 1 year. Figure 13.2 (lhs, middle) shows a histogram and a boxplot of these exposures. In view of the histogram we conclude that roughly 1/4 of all policies have a full exposure of 1 calendar year, and all other policies are only partly exposed during the calendar year. From a practical insurance point of view this high ratio of partly exposed policies seems rather

**R94**

**Fig. 13.2** (lhs) Histogram of Exposure, (middle) boxplot of Exposure, (rhs) number of observed claims ClaimNb of the French MTPL data


unusual. A further inspection of the data indicates that policy renewals during the year account for two separate records in the data set. Of course, such split policies should be merged to one yearly policy. Unfortunately, we do not have the necessary information to perform this merger, therefore, we need to work with the data as it is. In Table 13.1 and Fig. 13.2 (rhs) we split the portfolio w.r.t. the number of claims. On 653'069 insurance policies (amounting to a total exposure of 341'090 yearsat-risk) we do not have any claim, and on the remaining 24'938 policies (17'269 years-at-risk) we have at least one claim. The overall portfolio claim frequency (w.r.t. Exposure) is *λ* = 7*.*35%.

We study the split of this overall frequency *λ* = 7*.*35% across the different feature levels. This empirical analysis is crucial for the model choice in regression modeling.<sup>6</sup> For the empirical analysis we provide 3 different types of graphs for each feature component (where applicable), these are given in Figs. 13.3, 13.4, 13.5, 13.6, 13.7, 13.8, 13.9, 13.10, and 13.11. The first graph (lhs) gives the split of the total exposure to the different feature levels, the second graph (middle) gives the average feature value in each French region (green meaning low and red meaning high),<sup>7</sup> and the third graph (rhs) gives the observed average frequency per feature level. This observed frequency is obtained by dividing the total number of claims by the total exposure per feature level. The frequencies are complemented by confidence bounds of two standard deviations (shaded area). These confidence bounds correspond to twice the estimated standard deviations. The standard deviations are estimated under

<sup>6</sup> The empirical analysis in these notes differs from Noll et al. [287] because data cleaning has been done differently here, we refer to Listing 13.1.

<sup>7</sup> We acknowledge the use of UNESCO (1987) database through UNEP/GRID-Geneva for the French map.

**Fig. 13.3** (lhs) Histogram of exposures per Area code, (middle) average Area code per Region, we map *(A, . . . , F )* → *(*1*,...,* 6*)*, (rhs) observed frequency per Area code

**Fig. 13.4** (lhs) Histogram of exposures per VehPower, (middle) average VehPower per Region, (rhs) observed frequency per VehPower

**Fig. 13.5** (lhs) Histogram of exposures per VehAge (censored at 20), (middle) average VehAge per Region, (rhs) observed frequency per VehAge

**Fig. 13.6** (lhs) Histogram of exposures per DrivAge (censored at 90), (middle) average DrivAge per Region, (rhs) observed frequency per DrivAge (*y*-scale is different compared to the other frequency plots)

**Fig. 13.7** (lhs) Histogram of exposures per BonusMalus level (censored at 150), (middle) average BonusMalus level per Region, (rhs) observed frequency per BonusMalus level (*y*scale is different compared to the other frequency plots)

a Poisson assumption, thus, they are obtained by ±2 6 *λk/*Exposure*k*, where *λk* is the observed frequency and Exposure*<sup>k</sup>* is the total exposure for a given feature level *k*. We note that in all frequency plots the *y*-axis ranges from 0% to 20%, except in the BonusMalus plot where the maximum is set to 60%, and the DrivAge plot where the maximum is set to 40%. From these plots we conclude that some levels have only a small underlying Exposure; BonusMalus leads to the highest variability in frequencies followed by DrivAge; and there is quite some heterogeneity.

Table 13.2 gives the assignment of the different VehBrand levels to car brands. This list has been compiled from the two data sets FreMTPLfreq and FreMTPL2freq contained in the R package CASdatasets [113], see Footnote 5.

Next, we analyze collinearity between the feature components. For this we calculate Pearson's correlation and Spearman's Rho for the continuous feature components, see Table 13.3. In general, these correlations are low, except for DrivAge vs. BonusMalus. Of course, the latter is very sensible because a BonusMalus

**Fig. 13.8** (lhs) Histogram of exposures per VehBrand, (rhs) observed frequency per VehBrand; for VehBrand assignment we refer to Table 13.2

**Fig. 13.9** (lhs) Histogram of exposures per VehGas, (middle) average VehGas per Region (diesel is green and regular red), (rhs) observed frequency per VehGas

**Fig. 13.10** (lhs) Histogram of exposures per population Density (on log-scale), (middle) average population Density per Region, (rhs) observed frequency per population Density; in general, we always consider Density on the log-scale

**Fig. 13.11** (lhs) Histogram of exposures Exposure, and (middle, rhs) observed claim frequencies per Region in France (prior to 2016)


**Table 13.3** Correlations in feature components: top-right shows Pearson's correlation; bottomleft shows Spearman's Rho; Density is considered on the log-scale; significant correlations are boldface


level below 100 needs a certain number of driving years without claims. We give the corresponding boxplot in Fig. 13.12 (lhs) which confirms this negative correlation. Figure 13.12 (rhs) gives the boxplot of log-Density vs. Area code. From this plot we conclude that the area code has likely been set w.r.t. the log-Density. For our regression models this means that we can drop the area code information, and we should only work with Density. Nevertheless, we will use the area code to show what happens in case of collinear feature components, i.e., if we replace *(A, . . . , F )* → *(*1*,...,* 6*)*.

Figure 13.13 illustrates each continuous feature component w.r.t. the different VehBrands. Vehicle brands B10 and B11 (Mercedes, Chrysler and BMW) have more VehPower than other cars, B10 being more likely a diesel car, and vehicle brand B12 (Japanese and Korean cars) has comparably new cars in more densely populated French regions.

**Fig. 13.12** Boxplots (lhs) BonusMalus vs. DrivAge, (rhs) log-Density vs. Area code; these plots are inspired by Fig. 2 in Lorentzen–Mayer [258]

More formally, the strength of dependence between categorical variables can be measured by Cramér's *V* . Cramér's *V* is based on the *χ*2-test of independence on contingency tables. We briefly explain this. Assume we have two-dimensional categorical features *x* = *(x*1*, x*2*)* ∈ *X* having *m*<sup>1</sup> and *m*<sup>2</sup> levels, respectively. Let *p<sup>x</sup>* describe the probability on *X* that a randomly chosen insurance policy takes feature *x*, and let *px*<sup>1</sup> and *px*<sup>2</sup> be the marginal distributions of *px*. If the two components of *x* are independent with these two marginals, then we have special (independence) distribution

$$
\pi\_{\mathfrak{x}} = p\_{\mathfrak{x}\_1} p\_{\mathfrak{x}\_2} \qquad \text{for all } \mathfrak{x} = (\mathfrak{x}\_1, \mathfrak{x}\_2) \in \mathcal{X}.
$$

The *χ*2-test for independence now analyzes *p<sup>x</sup>* vs. *πx*. Assume we have *n* observations. Denote by *n<sup>x</sup>* = *nx*1*,x*<sup>2</sup> the number of instances that have feature *x* = *(x*1*, x*2*)*, and let *nx*1*,***·** and *n***·***,x*<sup>2</sup> be the corresponding marginal observations. The *χ*2-test statistics is given by

$$\chi^2 = \sum\_{\mathbf{x} = (x\_1, x\_2) \in \mathcal{X}} \frac{\left(n\_{\mathbf{x}} - \frac{n\_{\mathbf{x}\_1, \cdot, n\_{\mathbf{x}, x\_2}}}{n}\right)^2}{\frac{n\_{\mathbf{x}\_1, \cdot, n\_{\mathbf{x}, x\_2}}}{n}}.$$

Under the null hypothesis of having independence between the components of *x*, the test statistics *<sup>χ</sup>*<sup>2</sup> converges in distribution to a *<sup>χ</sup>*2-distribution with *(m*1*m*<sup>2</sup> <sup>−</sup> <sup>1</sup>*)* degrees of freedom if we let the number of independently drawn instances go to infinity. Seven different proofs of this statement are given in Benhamou–Melot [30].

**Fig. 13.13** Distribution of the variables VehPower, VehAge, DrivAge, BonusMalus, log-Density, VehGas for each car brand VehBrand, individually


**Table 13.4** Cramér's *V* for the categorical feature components vs. the categorized continuous components

**Fig. 13.14** VehBrands in the different French Regions

We scale the test statistics to the interval [0*,* 1] by dividing it by the comonotonic (maximal dependent) case and by the sample size *n*. This motivates Cramér's *V*

$$V = \sqrt{\frac{\chi^2/n}{\min\{m\_1 - 1, m\_2 - 1\}}} \in [0, 1].$$

Section 7.2.3 of Cohen [78] gives a rule of thumb for small, medium and large dependence. Cohen [78] calls the association between *x*<sup>1</sup> and *x*<sup>2</sup> small if *<sup>V</sup>* <sup>√</sup>min{*m*<sup>1</sup> <sup>−</sup> <sup>1</sup>*, m*<sup>2</sup> <sup>−</sup> <sup>1</sup>} is less 0.1, it is of medium strength for *<sup>V</sup>* <sup>√</sup>min{*m*<sup>1</sup> <sup>−</sup> <sup>1</sup>*, m*<sup>2</sup> <sup>−</sup> <sup>1</sup>} of size 0.3, and it is a large effect if this value is around 0.5. Our results are presented in Table 13.4. Clearly, there is some association between VehBrand and both VehPower and VehAge, this can also be seen from Fig. 13.13, for the remaining variables the dependence is somewhat weaker. Not surprisingly, Cramér's *V* shows the largest value between Region and log-Density.

In Fig. 13.14 we show the VehBrands in the different French Regions, Cramér's *<sup>V</sup>* is 0.13 for these two categorical variables, multiplying with <sup>√</sup><sup>11</sup> <sup>−</sup> <sup>1</sup> gives a value bigger than 0.4 which is a considerable association according to Cohen [78]. We note that in some regions the French car brands B1 and B2 are very dominant, whereas on the Isle of Corse (R94) 80% of the cars in our portfolio are Japanese

**Fig. 13.15** Empirical density and log-log plots of the observed claim amounts

or Korean cars B12. Our portfolio has its biggest exposure in Region R24, see Fig. 13.11, in this region French cars are predominant.

Next, we study the claim sizes of this French MTPL example. Figure 13.15 shows the empirical density plot and the log-log plot. These two plots already illustrate the main difficulty we often face in claim size modeling. From the empirical density plot we observe that there are many payments of fixed size (red vertical lines) which do not match any absolutely continuous distribution function assumption. The loglog plot shows heavy-tailedness because we observe asymptotically a straight line with negative slope on the log-scale, this indicates regularly varying tails and, thus, the EDF is not a suitable model on the original observation scale.

Figure 13.16 gives the boxplots of the claim sizes per feature level (we omit the claims outside the whiskers because heavy-tailedness would distort the picture). The empirical mean in orange is much bigger than the median in red color, which also expresses the heavy-tailedness. From these plots we conclude that the claim sizes seem less sensitive in feature values which may question the use of a regression model for claim sizes.

Figure 13.17 shows the density plots for different feature levels. Interestingly, it seems that the features determine the sizes of the modes, for instance, if we focus on Area, Fig. 13.17 (top-left), we see that the area codes mainly influence the sizes of the modes. This may be interpreted by modes corresponding to different claim types which occur at different frequencies among the area codes.

## **13.2 Swedish Motorcycle Data**

Our second example considers the Swedish motorcycle data which originally has been used in Ohlsson–Johansson [290]. It is available through the R library

**Fig. 13.16** Boxplots of claim sizes per feature level: these plots omit the claims outside the whiskers; red color shows the median and orange color the empirical mean

CASdatasets [113], and it is called swmotorcycle. Listing 13.3 shows the data cleaning that we have used, and Listing 13.4 gives an excerpt of the cleaned data.

We briefly describe the data. The data considers comprehensive insurance for motorcycles. This covers loss or damage of motorcycles other than collision, e.g., caused by theft, fire or vandalism. The data considers aggregated claims on feature levels for years 1994–1998. We have claims on 656 out of the 62'036 different features, thus, only slightly more than 1% of all feature combinations suffer a claim in the considered period.

**Fig. 13.17** Empirical claim size densities split w.r.t. the different levels of the feature components

We start by describing the available variables on lines 2–10 of Listing 13.4:


```
1 library(CASdatasets)
2 data(swmotorcycle)
3 mcdata <- swmotorcycle
4 mcdata$Gender <- as.factor(mcdata$Gender)
5 mcdata$Area <- as.factor(mcdata$Area)
6 mcdata$Area <- factor(mcdata$Area,levels(mcdata$Area)[c(1,7,3,6,5,4,2)])
7 mcdata$Area <- c("Zone 1","Zone 2","Zone 3","Zone 4","Zone 5",
8 "Zone 6","Zone 7")[as.integer(mcdata$Area)]
9 mcdata$Area <- as.factor(mcdata$Area)
10 mcdata$RiskClass <- as.factor(mcdata$RiskClass)
11 mcdata$RiskClass <- factor(mcdata$RiskClass,
12 levels(mcdata$RiskClass)[c(1,6,7,3,4,5,2)])
13 mcdata$RiskClass <- as.integer(mcdata$RiskClass)
14 mcdata$BonusClass <- as.integer(as.factor(mcdata$BonusClass))
15 #
16 mcdata <- mcdata[which(mcdata$OwnerAge>=18),] # only minimal age 18
17 mcdata$OwnerAge <- pmin(70, mcdata$OwnerAge) # set maximal age 70
18 mcdata$VehAge <- pmin(30, mcdata$VehAge) # set maximal motorcycle age 30
19 mcdata <- mcdata[which(mcdata$Exposure>0),] # only positive exposures
```
**Listing 13.4** Excerpt of the Swedish motorcycle data set



We start with a descriptive and exploratory analysis of the Swedish motorcycle data of Listing 13.4. We have *n* = 62 036 different feature combinations with positive Exposure. This Exposure is aggregated over individual policies with a fixed feature combination. We denote by *Ni* the number of claims on feature *i*, this corresponds to ClaimNb, and the total claim amount ClaimAmount is denoted by *Si* <sup>=</sup> *Ni <sup>j</sup>*=<sup>1</sup> *Zi,j* , where *Zi,j* are the individual claim sizes on feature *<sup>i</sup>* (in case of claims). The empirical claim frequency is *<sup>λ</sup>*¯ <sup>=</sup> *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *Ni/ <sup>n</sup> <sup>i</sup>*=<sup>1</sup> *vi* <sup>=</sup> <sup>1</sup>*.*05%, and the average claim size is *<sup>μ</sup>*¯ <sup>=</sup> *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *Si/ <sup>n</sup> <sup>i</sup>*=<sup>1</sup> *Ni* <sup>=</sup> <sup>24</sup> 641 Swedish crowns SEK.

**Fig. 13.18** (lhs) Boxplot of Exposure on the log-scale (the horizontal line corresponds to 1 accounting year), (rhs) histogram of the number of observed claims ClaimNb per feature of the Swedish motorcycle data

Figure 13.18 shows the boxplot over all Exposures and the claim counts on all insurance policies. We note that insurance claims are rare events for this product, because the empirical claim frequency is only *λ*¯ = 1*.*05%.

Figures 13.19 and 13.20 give the marginal total exposures (split by gender), the marginal claim frequencies and the marginal average claim amounts for the covariate components OwnerAge, Area, RiskClass, VehAge and BonusClass. We observe that we have a very imbalanced portfolio between genders, only 11% of the total exposure is coming from females. The empirical claim frequency of females is 0.86% and the one of males is 1.08%. We note that the female claim frequency comes from (only) 61 claims (based on an exposure for female of 7'094 accounting years, versus 57'679 for male). Therefore, it is difficult to analyze females separately, and all marginal claim frequencies and claim sizes in Figs. 13.19 and 13.20 (middle and rhs) are analyzed jointly for both genders. If we run a simple Poisson GLM that only involves Gender as feature component, it turns out that the female frequency is 20% lower than the male frequency (remember we have the balance property on each dummy variable, see Example 5.12), but this variable should not be kept in the model on a 5% significance level. The same holds for claim amounts.

The empirical marginal frequencies in Figs. 13.19 and 13.20 (middle) are complemented with confidence bands of ±2 standard deviations. From the plots we conclude that we should keep the explanatory variables OwnerAge, Area, RiskClass and VehAge, but the variable BonusClass does not seem to have any predictive power. At the first sight, this seems surprising because the bonus class encodes the past claims history. The reason that the bonus class is not needed for our claims is that we consider comprehensive insurance for motorcycles covering loss or damage of motorcycles other than collision (for instance, caused by theft, fire or vandalism), and the bonus class encodes collision claims.

**Fig. 13.19** (Top, middle and bottom rows) OwnerAge, Area, RiskClass: (lhs) histogram of exposures (split by gender), (middle) observed claim frequency, (rhs) boxplot of observed average claim amounts *μ*¯*<sup>i</sup>* = *Si/Ni* of features with *Ni >* 0 (on log-scale)

For a regression analysis Zones 5 to 7 should be merged because of small exposures and a similar behavior, the same applies to RiskClass 6 and 7, and VehAge above 20.

Figure 13.21 shows the correlations between the features: (top) correlations between continuous features, (bottom), dependence between continuous features and the categorical Area features. We have some dependence, for instance, in Zone 1 (three largest Swedish cities) the motorcycles are more light (RiskClass) and less old. Older people drive less heavy motorcycles that are more old, and older motorcycles are less heavy.

Figure 13.22 gives the empirical density, empirical distribution and log-log plot of average claim amounts *μ*¯*<sup>i</sup>* = *Si/Ni*. From the log-log plot we conclude that the average claim amounts are not heavy-tailed for this motorcycle insurance product.

**Fig. 13.20** (Top and bottom rows) VehAge, BonusClass: (lhs) histogram of exposures (split by gender), (middle) observed claim frequency, (rhs) boxplot of observed average claim amounts *μ*¯*<sup>i</sup>* = *Si/Ni* of features with *Ni >* 0 (on log-scale)

## **13.3 Wisconsin Local Government Property Insurance Fund**

The third example considers property insurance claims of the Wisconsin Local Government Property Insurance Fund (LGPIF). This data8 has been made available through the book project of Frees [135],<sup>9</sup> and is also used in Lee et al. [236]. The Wisconsin LGPIF is an insurance pool that is managed by the Wisconsin Office of the Insurance Commissioner. This fund provides insurance protection to local governmental institutions such as counties, schools, libraries, airports, etc. It insures property claims for buildings and motor vehicles, and it excludes certain natural and man made perils like flood, earthquakes or nuclear accidents. We give a description of the data (we have applied some data cleaning to the original data).

The special feature of this data is that we have a short claim description on line 11 of Listing 13.5. This description will allow us to better understand the claim type beyond just knowing the hazard type that has been affected.

Figure 13.23 gives the empirical density (upper-truncated at 50'000) and the log-log plot of the observed LGPIF claim amounts. Most claims are below 10'000, however, the log-log plot shows clearly that the data is heavy-tailed, the largest claim being

<sup>8</sup> https://github.com/OpenActTexts/Loss-Data-Analytics/tree/master/Data.

<sup>9</sup> https://ewfrees.github.io/Loss-Data-Analytics/.

**Fig. 13.21** (Top) Correlations: top-right shows Pearson's correlation; bottom-left shows Spearman's Rho; (bottom) boxplots of OwnerAge, RiskClass, VehAge versus Area (where Zones 5–7 have been merged)

**Fig. 13.22** (lhs) Empirical density (middle) empirical distribution and (rhs) log-log plot of average claim amounts *μ*¯*<sup>i</sup>* = *Si/Ni* of features with *Ni >* 0

12'922'218 and 13 claims being above 1 million. These claims are further described by the features given in Listing 13.5.

In our example we will not focus on modeling the claim sizes, but we rather aim at predicting the hazard types from the claim descriptions. There are 9 different hazard types: Fire, Lightning, Hail, Wind, WaterW, WaterNW, Vehicle, Vandalism and Misc. The last label contains all claims that cannot be allocated to one of the previous hazard types, and WaterW refers to weather related water claims and WaterNW to the non-weather related ones. If we only focus on this latter problem we have more data available as there is a training data set and a validation data

**Fig. 13.23** (lhs) Empirical density (upper-truncated at 50'000), (rhs) log-log plot of the observed LGPIF claim amounts



set with hazard types and claim descriptions.10 In total we have 6'031 such claim descriptions, see Listing 13.6, which are studied in our text recognition Chap. 10.

**Listing 13.6** Excerpt of the Wisconsin LGPIF claim descriptions

```
1 'data.frame': 6031 obs. of 2 variables:
2 Hazard : Factor w/ 9 levels "Fire","Hail",..: 1 3 3 5 5 9 3 6 ...
3 Description: chr "fire damage at Town Hall"
4 "lightning damage at water tower" ...
```
<sup>10</sup> https://github.com/OpenActTexts/Loss-Data-Analytics/tree/master/Data.

## **13.4 Swiss Accident Insurance Data**

Our next example considers Swiss accident insurance data.11 This data set is not publicly available. Swiss accident insurance is compulsory for employees, i.e., by law each employer has to sign an insurance contract to protect the employees against accidents. This insurance cover includes both work and leisure accidents, and it covers medical expenses and daily allowance. Listing 13.7 gives an excerpt of the data. Line BU indicates whether we have a workplace or a leisure accident, line 10 gives the medical expenses and line 12 shows the allowance expenses. In the subsequent analysis we only consider medical expenses.

**Listing 13.7** Excerpt of the Swiss accident insurance data set


Sector indicates the labor sector of the insured company, AccQuart gives the accident quarter since leisure claims have a seasonal component, RepDel gives the reporting delay in yearly units, Age is the age of the injured (in 5 years buckets), and InjType and InjPart denote the injury type and the injured body part.

Figure 13.24 gives the empirical density (upper-truncated at 10'000) and the loglog plot of the observed Swiss accident insurance claim amounts. Most claims are below 5'000, however, the log-log plot shows some heavy-tailedness, the largest claim exceeding 1'300'000 CHF.

Figure 13.25 shows the average claim amounts split w.r.t. the different feature components (top) Sector, AccQuart, RepDel, (bottom) Age, InjType, InjPart, and moreover, split by work and leisure accidents (in cyan and gray in the colored version). Typically, leisure accidents are more numerous and more expensive on average than accidents at the work place. From Fig. 13.25 (top, left) we observe considerable variability in average claim sizes between the different labor sectors (cyan bars), whereas average leisure claim sizes (gray bars) are similar

<sup>11</sup> https://www.unfallstatistik.ch/.

**Fig. 13.24** (lhs) Empirical density (upper-truncated at 10'000), (rhs) log-log plot of the observed Swiss accident insurance claim amounts

**Fig. 13.25** Average claim amounts split w.r.t. the different feature components (top) Sector, AccQuart, RepDel, (bottom) Age, InjType, InjPart, and split by work and leisure accidents (cyan/gray in the colored version)

across the different labor sectors. Average claim sizes considerably differ between injury types and injured body parts (bottom, middle and right), but they do not differ between work and leisure claims.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Bibliography**


# **Index**

#### **A**

Absolutely continuous, 3, 4, 15, 20–26, 80, 183, 240, 342 Accumulated local effects (ALE) profile, 362–365, 369 Action rule, 50, 89 Action space, 50, 52, 82, 88, 89, 488 Activation function, 269–274, 278, 283, 295, 304, 341, 356, 383, 385, 390, 391, 393, 404, 409, 537–542, 548 Actuarial valuation, 525 Adam, 287, 288, 297 Adaptive moments, 287 Additive approach, 478–482, 488 Additive effects, 116, 126 Additive form, 30, 35, 38 Additive semi-group, 174 Aggregated statistics, 62 Akaike's information criterion (AIC), 103–106 Algorithmic culture, 2 Amari–Chentsov tensor, 42 Analysis of variance (ANOVA), 141, 148, 149, 245 Approximate sieve estimator, 544, 548, 549 Approximation capacity, 274, 288 ARIMA, 394, 402 a.s., 6 Asymptotically efficient, 72 Asymptotic normality, 69–74, 104, 105, 109, 123–124, 146, 180, 181, 198, 225, 326, 328, 474, 476, 540–546, 548 Asymptotic variance, 72, 474 Attention, 447–451, 497, 498 Attention layer, 447–451 Attention weights, 448, 521 Auto-calibrated forecast, 308–311

Auto-calibration, 308–315, 328, 329, 335, 465, 466 Auto-encoder, 341–356 Automated feature engineering, 267, 293 Auto-resolution, 310 Average estimation loss, 50 Average local effect, 362, 364 Average-pooling, 416

#### **B**

Back-propagation, 280–284, 291, 300, 388, 533 Back-propagation through time (BPTT), 388 Backward elimination, 141, 147, 148, 150, 170, 335 Bagging, 319, 324, 327 Bag-of-part-of-speed, 425 Bag-of-words, 425–429 Balance-corrected, 310–315 Balance property, 64, 65, 122–124, 129, 154, 157, 162, 165, 167, 171, 179, 201, 210, 297, 300, 304–315, 501, 505 Base premium, 116, 140 Batch, 285, 291–293, 304, 321, 515, 517 Batch size, 285, 291, 295, 299, 320, 331 Bayes by Backprop, 533 Bayesian decision rule, 51, 54, 156 Bayesian GLM, 210 Bayesian information criterion (BIC), 105 Bayesian methods, 207–266, 535 Bayesian networks, 530–535 Bayesian parameter estimation, 207–209 Bayes' rule, 208, 233, 235, 430 Bernoulli distribution, 18–19, 87, 163, 304

© The Author(s) 2023 M. V. Wüthrich, M. Merz, *Statistical Foundations of Actuarial Learning and its Applications*, Springer Actuarial, https://doi.org/10.1007/978-3-031-12409-9


#### **C**

Canonical link, 17–19, 21, 23, 27, 30, 31, 52, 63, 78, 114–118, 120–123, 127, 129, 133, 196–198, 264, 279, 306–308, 312, 457


Claim description, 425–428, 432, 440, 442, 446, 570, 572 Claim sizes, 4, 7, 35, 36, 111, 112, 126, 167–180, 188–190, 248–254, 257–259, 453–455, 458–466, 515 Classification and regression trees (CART), 200, 201, 270, 320, 358 Clustering, 239, 342, 436, 438 CN layer, 407, 408, 410, 411, 413, 415, 417, 422 CN network, 273, 394, 407, 411, 412, 415–419, 421–424 CN network encoder, 422 CN operation, 409, 410, 413 Coefficient of variation, 176, 185, 323–325, 332, 492–494 Collinear, 129, 153, 560 Collinearity, 146, 150, 152, 162, 214, 359, 551, 558 Color image, 408 Combined GLM and FN network, 318, 319, 321 Complete information, 232, 233 Complete log-likelihood, 232–233, 235–237, 250, 253–255, 258, 260, 518, 519 Complete probability space, 542, 548 Composite models, 202, 263–265, 454, 483, 484, 491 Composite triplet, 484–488 Compound Poisson (CP), 4, 34, 189–190 Conditional calibration, 310 Conjugate priors, 156, 209 Conjugation, 427 Consistency, 67–69, 73, 89–90, 109, 124, 225, 473, 540–546, 548 Consistent, 67–69, 71, 74, 84, 88–92, 109, 143–147, 204, 226, 473, 475, 478, 482, 484, 546 Consistent loss function, 88, 90–93, 204, 464, 473, 477, 487, 488 Consistent scoring function, 76, 88, 92, 456, 457, 485 Constraint optimization, 212, 222, 223, 280 Context size, 431 Context words, 431, 432 Contingency table, 151, 162, 166, 263, 561 Continuous bag-of-words (CBOW), 430, 432 Continuous features, 130–131, 150 Contrast coding, 128 Convex, 2, 14, 16, 17, 29, 30, 37, 38, 43, 44, 92, 94, 213, 214, 219, 221–222, 286,


Convolution formula, 31, 65 Convolution operator, 410 Convolution property, 174 Co-occurrence matrix, 436 Count random variable, 4, 254 Covariate, 88, 93, 96, 113 Coverage ratio, 108, 481, 482, 490, 491, 498 Cramér–Rao information bound, 56–66, 72, 75, 78, 93, 123 Cross-entropy, 87, 198, 421, 518 Cross-sectional data, 198 Cross-validation, 95–106, 139, 140, 190, 192–195, 211, 215–217 Cube-root transformation, 170 Cumulant function, 14–17, 24, 25, 29, 30, 34, 37–38, 41, 44–46 Curse of dimensionality, 209, 530 Cyclic coordinate descent, 220

#### **D**

Data collection, 1, 151 Data compression, 271 Data modeling culture, 2 Data pre-processing, 1, 2 Decision rule, 50–54, 56–58, 60–62, 67, 75–79, 83–85, 96, 97 Decision-theoretic approach, 88–95 Decision theory, 49–51 Declension, 428 Decoder, 344, 346, 353, 356, 404 Deductible, 248, 249 Deep composite model regression, 266, 483–491 Deep dispersion modeling, 466–472 Deep learning, 267–379, 453–535 Deep network, 204, 273–275, 383, 477, 494 Deep quantile regression, 10, 476–483, 488, 489 Deep RN network, 383, 385, 386 Deep word representation learning, 445–448 Deformation stability, 411 Dense, 538, 539, 542 Density, 4, 7, 8 Depth, 268, 272, 273, 275–277 Derivative operator, 546 Design matrix, 114, 118–122, 128, 145, 177, 306 Deviance estimate, 143 Deviance generalization loss, 79–88 Deviance GL, 82, 84–86, 310, 311 Deviance loss function, 43, 79–82, 84, 93, 279, 280, 284, 462, 468, 475, 478

Deviance residuals, 142, 158, 170, 176, 182, 459, 460, 466, 471 Diagnostic tools, 141, 190–195 Digamma function, 22, 46, 47, 173, 185 Dilation, 414 Dimension reduction, 342, 343, 415, 417, 520 Directional derivative, 367, 368, 372 Discount factor, 526, 527 Discrete, 3, 4, 18, 27 Discrete random variable, 3 Discrete window, 409 Discrimination-free insurance pricing, 361 Dispersion, 13–46, 155, 157, 158, 181–183, 186–189, 465–472, 475, 476 Dispersion parameter, 30 Dispersion submodel, 182–183, 453, 474 Dissimilarity function, 343, 352, 353 Distortion risk measure, 368, 370 Distribution function, 3, 5–9, 13–16, 29 Divergence, 40–47, 55, 92, 94, 308 Divisible, 174 Do-operator, 360 Dot-product attention, 448 Double FN network model, 466, 470, 472 Double generalized linear model (DGLM), 182–190, 247, 453, 466, 515 Drift extrapolation, 394–397, 401, 404 Drop-out layer, 298, 302–304, 377, 419 Duality transformation, 30, 38, 158 Dual parameter space, 17–22, 24, 25, 27, 28, 31–33, 37, 43, 53, 64 Dummy coding, 127–130, 195, 293, 298

#### **E**

Early stopping, 290–293, 299, 303 Educated guess, 50 Effective domain, 14–24, 27, 29, 30, 32, 34, 35 Efficient likelihood estimator (ELE), 72 Eigenvalues, 120, 344, 345 Eigenvectors, 344, 345 Elastic net regularization, 214 Elastic net regularizer, 507 Elicitable, 92, 203, 204, 477, 484, 489 EM algorithm for lower-truncated data, 248–249 EM algorithm for mixture distributions, 230–232 EM algorithm for right-censored data, 251–254 Embedding dimension, 299, 302, 429, 431, 438, 440–442, 444, 446 Embedding layer, 298–302, 429 Embedding map, 128, 294, 298, 399, 429–433, 437, 440, 441, 444, 448

Embedding theorem, 547 Embedding weights, 299, 302, 399, 403, 431, 434, 435, 438, 440, 446 EM forward network, 515, 517 EM network boosting, 515 Empirical bootstrap distribution, 107, 108, 110 Empirical density, 7, 8, 56 Empirical distribution, 7–9, 55, 68, 106 Empirical Wald test, 495, 520 Encoder, 344, 353, 355, 422 Ensembling over selected networks, 335–337 Entropy, 41, 87, 198, 310, 311, 421, 518, 531, 535, 541 Epoch, 285, 292, 297, 301, 319, 320 Estimate, 75–76 Estimation loss, 50 Estimation of conditional expectation, 521–529 Estimation risk function, 83, 84, 86, 327 Estimation theory, 49–74 Estimation variance, 77, 79, 209 Euclidean ball, 213 Euclidean norm, 213, 444 Evidence lower bound (ELBO), 531, 533, 535 Excess-of-loss (XL), 249 Expectation-Maximization (EM) algorithm, 231, 233–236, 238–242, 244, 249, 251, 253, 254, 257, 258, 261, 263–265 Expectation step (E-step), 233–235, 238, 247, 251, 252, 254, 257, 258, 513 Expected deviance generalization loss (GL), 83, 84, 86, 88, 93, 95, 310, 311 Expected generalization loss (GL), 75–79, 83, 88, 310, 325 Expected shortfall, 483–487 Expected value, 4, 62, 77 Experience rating, 141, 199 Explain, 1, 111 Explanatory variable, 8, 9, 111, 113, 130, 338, 568 Exponential activation, 269, 270, 516 Exponential dispersion family (EDF), 13–47 Exponential distribution, 23, 28, 126 Exponential family (EF), 13–47 Exponentially decaying survival function, 38 Exposure, 30, 112, 132 Extreme stable distribution, 36


Feature pre-processing, 148, 293–295, 425–429 Feature space, 113–116 Feed-forward network, 268 Feed-forward neural network, 267, 269–298, 340–342 Filter, 199, 209, 407–415, 417, 419, 421–423 Filter weights, 409–411, 414, 417 Finite first absolute moments of Fourier magnitudes distributions, 545 First moment, 4 Fisher-consistent, 55, 56, 68, 71 Fisher metric, 58 Fisher's contribution, 59 Fisher's information, 42, 58–62, 70–71, 118–122 Fisher's scoring, 192, 216 Fisher's scoring method, 59, 119, 120, 138, 180, 187, 244, 264, 459 Flatten layer, 416, 423, 442 FN layer, 271–273 FN network, 269 Folds, 100, 246, 442 Force of mortality, 347, 405, 525, 526, 529 Forecast, 75–110 Forecast dominance, 93–95, 312–314, 469–461, 464–465 Forecast evaluation, 40, 45, 75–110, 476 Forget gate, 390–392 Forward selection, 141, 147, 148 Friedman's *H*-statistics, 365 Frobenius norm, 345, 347 Full rank, 17, 27, 118–121, 127–129, 177, 181, 232, 294, 306 Functional limit theorem, 546–549 Fundamental domain, 341, 342

#### **G**

Gamma distribution, 22–23, 32, 34, 36, 121, 156, 168, 170 Gamma GLM, 167–176 Gamma model with log-link, 121 Gated recurrent unit (GRU) network, 381, 390, 392–394 Gaussian distribution, 13, 21–27, 36, 43, 70, 84, 126, 212, 240, 399, 498, 501, 526 Gaussian kernel, 8 Gaussian mixture, 240 Gaussian model, 21, 25, 26, 28 Gaussian process, 549, 550 Generalization, 43, 83, 87, 98, 99, 135, 141, 148, 169, 288, 305, 351, 449

Generalization loss (GL), 10, 75–95, 152, 310 Generalization power, 145, 289 Generalized additive decomposition, 496 Generalized additive models (GAMs), 130, 194, 200, 314, 315, 337, 444 Generalized beta of the second kind (GB2), 201, 202, 453 Generalized cross-validation (GCV), 100, 193, 195, 217, 314 Generalized EM (GEM) algorithm, 513 Generalized inverse, 202 Generalized inverse Gaussian distribution, 25–26 Generalized linear model (GLMs), 111 Generalized projection operator, 222, 223, 228, 280, 508 Gibbs sampling, 209 Glivenko–Cantelli theorem, 7, 55, 68, 106, 107 Global balance, 312 Global max-pooling layer, 419 Global properties, 407 Global surrogate model, 358, 359 Global vectors (GloVe), 425, 430, 433, 436, 438, 442–444, 446, 449, 451 Glorot uniform initializer, 284 GPS location data, 418 Gradient descent method, 220, 278–293 Gradient descent update, 221, 279, 285–287 Grouped penalties, 211, 227, 302 Group LASSO generalized projection operator, 228 Group LASSO regularization, 226–229, 508–512 GRU cell, 393 Guaranteed minimum income benefit (GMIB), 525–529

#### **H**

Hadamard product, 287, 391 Hamilton Monte Carlo (HMC) algorithm, 209, 530 Hat matrix, 189–193, 216 Heavy-tailed, 6, 8, 27, 38 Helmert's contrast coding, 128 Hessian, 16, 42, 61, 105, 118, 121, 122, 215, 285 Heterogeneous, 111, 112, 132, 182, 448, 469 Heterogeneous dispersion, 182, 469 Heteroskedastic, 177, 178, 481 Hilbert space, 524, 547 Homogeneity, 111, 200, 456, 467 Homogeneous model, 103, 112, 114 Homoskedastic, 178, 193, 217

	- Hypothesis testing, 50, 145–147, 181, 549–551

#### **I**

Identifiability, 17, 49, 55, 239, 340–342, 348, 475, 497 Identification function, 481, 489 Identity link, 116, 279, 524 Image classification, 418 Image recognition, 273, 407, 412–413 Imbalanced, 153, 568 Importance measure, 505, 511 Incomplete gamma function, 252, 490 Incomplete information, 232, 233 Incomplete log-likelihood, 236–239, 250, 253, 254, 516, 518 Indirect discrimination, 361 Individual attribution, 374, 375 Individual conditional expectation (ICE), 359–360 Infinitely divisible, 174 Inflectional forms, 428 Information bound, 57, 62–64 Information geometry, 40–47, 81, 145 Initialization, 239, 241, 268, 282, 284, 293, 318, 354, 363, 479, 497, 515 Input gate, 391, 392 Input tensor, 408, 410, 411, 413, 414, 416, 423 In-sample loss, 98, 102, 103 In-sample over-fitting, 102, 279, 288, 525 Interactions, 131, 151, 200, 274, 297, 319, 360, 365, 373, 379, 495, 503, 505 Interaction strength, 365–366 Intercept model, 112, 139, 152, 154, 171, 188, 333 Interior, 15, 29, 44, 62, 112, 235, 476, 547 Inverse Gaussian distribution, 23, 26, 33, 174, 482, 490 Inverse Gaussian GLM, 122, 173–176, 453, 460 Inverse link function, 269 IRLS algorithm, 119, 120, 181, 186, 198 Irreducible risk, 77, 84, 86, 113, 310, 477, 492 Iterative re-weighted least squares algorithm, 119

**J** Jacobian, 46, 47, 61, 181, 474, 535 Jacobian matrix, 70, 535 Joint elicitability, 483–487

#### **K**

Kalman filter, 199 Karush-Kuhn-Tucker (KKT), 212 Kernel size, 408 Kernel smoother, 7 Key, 448–450 *K*-fold cross-validation, 99–101, 139, 141 *k*-th moment, 54 Kullback–Leibler (KL) divergence, 40–45, 55, 56, 69, 81, 82, 87, 104, 105, 145, 237, 238, 456, 530, 531

#### **L**


#### **M**


Markov chain Monte Carlo (MCMC) methods, 10, 209, 210, 530 Martingale sequence forecast, 309 Maximal a posterior (MAP) estimator, 210–212, 225 Maximal cover, 248 Maximization step, 233, 513 Maximum likelihood, 51, 116–122 Maximum likelihood estimation/estimator (MLE), 51, 124–125, 169, 172–174, 181, 186–187, 196–198, 288, 293, 472–476 Max-pooling, 414, 415, 419, 423, 442, 444 Mean, 4 Mean field Gaussian variational family, 534 Mean functional, 84, 90–93, 195, 203, 278, 456 Mean parameter space, 17, 116, 458, 473 Mean squared error of prediction (MSEP), 10, 75–79, 83, 95, 142, 209 Memory rate, 390 Mercer's kernel, 132, 268, 269 M-estimation, 93, 476 M-estimator, 69, 73, 93, 457 Meta model, 329, 330 Method of moments, 54, 55 Method of moments estimator, 54, 55 Method of sieve estimators, 11, 56, 543–546, 549 Metropolis–Hastings (MH) algorithm, 209 Mini-batches, 285, 291, 293, 304, 321, 469, 515, 518 Minimal representation, 16, 17, 30, 31, 42, 63, 67 Minimax decision rule, 50–51 MinMaxScaler, 294, 295, 371 Mixed Poisson distribution, 20, 155, 164 Mixture density, 230, 233, 242, 243, 454 Mixture density networks (MDNs), 233, 453, 513, 515–520 Mixture distribution, 163, 164, 230–235, 238–247, 259, 513 Mixture probability, 230, 231, 235, 236, 241, 243, 245–247, 513 Model-agnostic tools, 357–376, 495 Model class, 2, 105, 454 Modeling cycle, 1–3 Model misspecification, 2, 305, 472 Model uncertainty, 56, 69, 82, 93, 453–476, 492–495, 530 Model validation, 2, 141–180, 357 Modified Bessel function, 26 Moment generating function, 5, 6, 9, 15, 16, 30, 31, 35, 38, 125, 168, 174, 201

Momentum-based gradient descent method, 280, 285–287 Momentum coefficient, 285, 287 Mortality, 3, 347–351, 354–356, 394–406, 422–424, 525, 526, 529 Mortality surface, 347, 349, 350, 355, 356, 394, 422–424 Motor third party liability (MTPL), 133 MSEP optimal predictor, 78 M-step, 233–235, 238, 239, 244, 251, 252, 256, 257, 513, 515, 517 Multi-class cross-entropy, 87, 421 Multi-dimensional array, 408 Multi-dimensional Cramér-Rao bound, 60, 62 Multi-index, 546 Multi-output network, 462, 463, 466, 479 Multiple outputs, 461, 462, 468, 488 Multiple quantiles, 478–479 Multiplicative approach, 479–482 Multiplicative effects, 116, 126 Multiplicative model, 128, 131

#### **N**

Nadam, 287, 288 Nagging, 320, 324–326 Nagging predictor, 324–329 Natural language processing (NLP), 10, 298, 425–451 NB1, 158 NB2, 157, 158 Negative-binomial distribution, 19, 156, 159 Negative-binomial GLM, 159, 160, 166 Negative-binomial model, 20, 156, 158–160, 163 Negative expected Hessian, 105, 118, 121, 215 Negative sampling, 431–436, 438, 440, 446, 450 Nested GLM, 145 Nested simulation, 522 Nesterov-accelerated version, 286 Network aggregating, 325 Network ensembling, 10, 492 Network output, 387–388 Network parameter, 272, 274 Network weight, 271, 284 Neurons, 269 New representation, 46, 132, 268 Newton–Raphson algorithm, 59, 120, 231 NLP pipeline, 425 Noisy part, 113, 288, 290 Nominal categorical feature, 127 Nominal outcome, 4, 127, 130, 195, 302, 364, 555

Non-linear activation, 269 Non-linear generalizations of PCA, 351 Non-monotone feature, 150–153, 178 Non-parametric bootstrap, 106–109, 492 Non-trainable, 318, 441, 450 Normalization layer, 273, 298, 303, 304, 520 Nuisance parameter, 19–22, 28, 119, 143, 157–161, 169, 175, 177–180, 182, 186, 202 Nuisance parameter estimation, 158–162 Null model, 241

#### **O**

Objective function, 2, 81, 97, 124, 220 Observation scale, 126, 167, 178 Observed information matrix, 120–122 Offset, 131–133, 139, 153, 165, 262, 264, 303, 313, 318, 515 One-hot encoding, 128, 232, 244, 292–300, 308, 428, 499, 500 One-period ahead forecast, 399 Oracle property, 225–226 Ordinal categorical feature, 127, 168 Orthogonal projection, 125, 145, 523 Orthonormal basis, 344, 345, 353 Out-of-sample loss, 95–98 Output gate, 391, 392 Output mapping, 461 Over-dispersed Poisson model, 180 Over-dispersion, 20, 32, 143, 146, 151, 155–162, 173, 337–340 Over-fitting, 102, 113, 133, 194, 210, 273, 279, 288–290, 293, 303, 307 Over-parametrized, 145, 403 Over-sampling, 153–155

#### **P**

Padding, 413, 415, 427, 428, 440 Panel data, 198, 381 Parameter estimation, 15, 17, 37, 51–56 Parameter estimation under lower-truncation, 254–264 Parameter estimation under right-censoring, 248–249 Parameter set, 49, 174 Parametric bootstrap, 109–110 Pareto distribution, 23, 27, 39, 40, 246 Parsimonious, 145, 152, 464, 470 Partial dependence plot (PDP), 360–366 Partial dependence profile, 360, 361 Partial derivative, 92, 123, 225, 362, 367, 404, 549, 551

Particle filters, 209 Part-of-speech (POS), 428 Pathwise cyclic coordinate descent, 220 Pearson's chi-square statistics, 82, 142 Pearson's estimate, 172, 181, 333 Pearson's residuals, 142 Phase type distribution, 201 Pinball loss, 477, 478, 481, 485 Pinball loss function, 202–204 Plain vanilla gradient descent, 278–280, 283, 285, 290, 532 Poisson distribution, 19, 32, 133, 155, 163, 259, 292 Poisson GLM, 133–134 Poisson unit deviances, 45, 144 Polya distribution, 19 Pooling layer, 415–416 Positive stable distribution, 36, 454 Posterior density, 208–210, 530, 531 Posterior distribution, 53, 54, 209, 237, 513 Posterior information, 140, 141 Posterior log-likelihood, 210 Power variance function, 35, 36, 86, 87, 181, 454, 467 Power variance parameter, 34, 36, 86, 94, 121, 122, 454–456, 459–462, 464, 466, 468–471, 487, 492 Predefined gradient descent methods, 287 Predict, 76, 78, 79 Predicting *vs.* explaining, 111 Prediction, 76–79, 89, 310, 476, 494 Predictive modeling, 43, 75–110 Predictor, 2, 75, 114, 309, 310, 320, 325 Prefixes, 428 Pre-processed features, 123, 169, 385, 387 Pre-trained word embeddings, 425, 430, 436, 438 Principal components analysis **(**PCA), 342, 344, 346–348 Prior density, 530 Prior distribution, 51, 53, 207, 210, 212, 342, 534 Prior information, 140, 141 Probability distortion, 367 Probability space, 3, 55, 309, 382, 521, 542, 548 Probability weight, 3, 4, 18–20, 66, 163, 259 Process variance, 77, 79, 84, 209 Projected gradient descent, 221 Proper scoring rule, 76, 88, 90, 91 Protected characteristics, 361 Proximal gradient descent algorithm, 223, 227 Proximity, 129, 298, 302, 429

Partial residuals, 153

#### Index 603

Pseudo maximum likelihood estimator (PMLE), 180, 473–475 Pseudo-norm, 542, 543 *p*-value, 146, 147, 151, 245

#### **Q**

QQ plot, 172, 242, 245, 333, 518 Quantile, 92, 203, 368, 476 Quantile level, 374, 375 Quantile regression, 10, 202–204, 483, 488 Quantile risk measure, 368 Quasi-generalized pseudo maximum likelihood estimator (QPMLE), 180, 475 Quasi-likelihood, 180–181 Quasi-Newton method, 120, 513 Quasi-Poisson model, 180 Quasi-Tweedie's model, 181 Query, 448, 449

#### **R**

Random component, 498, 501, 505, 521 Random effects, 198–199 Random forest, 200, 319 Random variable, 3–7 Random vector, 3, 109, 180, 255, 371, 523, 526 Random walk, 350, 394–397, 401, 402, 404 Rank, 17, 118–121, 127, 177, 232, 344, 348, 354 Raw mortality data, 347, 394, 406, 422 Reconstruction error, 342, 344, 346, 350, 352, 354 Rectified linear unit activation, 269, 270, 274, 275, 479 Recurrent neural network, 381–406 Red-green-blue (RGB), 289 Reference point, 371–374 Regression attention, 496–498, 502, 503, 505, 507, 511, 520, 521 Regression function, 113, 267, 269, 485, 488, 512 Regression modeling, 88, 112–113 Regression parameter, 88, 114, 116, 119, 122, 131–133, 180, 189, 192, 208, 210, 271, 486, 496, 514 Regression trees, 133, 200, 307, 330, 359 Regularization, 207–268, 306–308, 314, 464, 507–509 Regularization through early stopping, 290–293 Regularly varying, 6, 8, 16, 23, 38, 39, 126, 167, 202, 241, 246

ReLU activation, 269, 270, 274, 275 Reparametrization trick, 532–534 Representation learning, 267–269, 273, 274, 293, 402, 411, 415, 445–448, 453, 461, 462 Reproductive dispersion model, 44 Reproductive form, 30, 38, 133, 158, 169, 174, 187, 459, 467 Resampling distribution, 107 Reset gate, 393 Residual bootstrap, 108 Residual maximum likelihood estimation, 186–187 Residuals, 108, 117, 141–145, 153, 158, 176, 178, 182, 191, 197, 315 Retention level, 249 Ridge regression, 214–217, 304 Ridge regularization, 212–214, 217, 224, 303, 507 Right-censored gamma claim sizes, 250–252 Right-censoring, 250–254 Right-singular matrix, 345, 348 Risk function, 50, 52, 61, 77, 83, 84, 86 rmsprop, 287, 297 RN layer, 383–391, 411 RN network, 273, 381, 383, 385–390, 394–406, 412, 442, 445, 448 Robustified representation learning, 461–464, 468 Robust statistics, 56 Root mean square propagation, 287

#### **S**

Saddlepoint approximation, 36, 47, 145, 183–187, 456, 459, 466, 467, 517 Saturated model, 80, 81, 113, 115, 157, 158, 166, 354 Scalar product, 114, 130, 200, 218, 269, 304, 312, 386, 433, 434, 444, 524 Scaled deviance GL, 88 Scale parameter, 22–24, 33, 34, 38, 168, 172, 201, 241, 252, 258, 517 Schwarz' information criterion (SIC), 106 Score, 58, 61, 71, 90, 117, 121, 180, 198, 218, 234, 457 Score equations, 71, 73, 117–120, 180, 196–198, 203, 215, 218, 236, 238, 304, 459 Scoring function, 69, 76–79, 81, 88–90, 456, 484, 485, 517, 518 2nd order marginal attribution, 377 Self-attention mechanism, 449, 450 Sequence of sieves, 542


#### **T**


#### Index 605

*t*-statistics, 147 Tukey–Anscombe plot, 172, 178, 190, 459, 464–466, 471 Tweedie's CP model, 34–36, 182, 183, 190, 325, 327, 454 Tweedie's distribution, 34–36, 87, 182, 457, 458, 487 Tweedie's family, 187, 454–458, 481, 487 Tweedie's forecast dominance, 95, 327, 453, 459 Tweedie's model with log-link, 121 Two modeling culture, 329

#### **U**

Unbiased, 57, 60–64, 66, 70, 77, 78, 85, 86, 93, 122, 123, 126, 309, 458 Unbiased estimator, 56–66, 85, 123, 458 Under-dispersion, 33, 264 Under-sampling, 153–155 Unfolded representation, 384, 386 Uniformly dense on compacta, 538, 539 Uniformly minimum variance unbiased (UMVU), 56, 57, 61, 63–67, 79, 93, 123 Unit deviance, 42–47, 80–83, 86–88, 90–95, 124–125, 141, 142, 144, 145, 183, 184, 311, 319, 454–456 Unit simplex, 27, 230, 232, 235 Universality theorem, 10, 11, 274–278, 288, 537–540, 550 Unsupervised learning, 342, 436, 445 Update gate, 393

#### **V**

VA account, 525, 527, 529 Validation analysis, 290 Validation data, 291–293 Validation loss, 291, 292, 297, 301, 331, 421 Value-at-risk (VaR), 370, 528, 529 Vapnik–Chervonenkis (VC) dimension, 541 Variable annuity (VA), 525 Variable permutation importance (VPI), 357–359, 372, 377 Variable selection, 148, 214, 225, 357, 495, 497–499, 507–509, 549

Variance function, 31–36, 44, 82, 86, 87, 117, 142, 158, 174, 180, 183, 185, 186, 209, 280, 454, 467, 474 Variance reduction technique, 327 Variational distributions, 530, 535 Variational inference, 535 Variational lower bound, 531 VC-class, 541 Vector-valued canonical parameter, 15 Vocabulary, 426, 428, 433, 434, 438, 440 Volume, 29, 129, 134, 145, 168, 184, 234

## **W**

Wald statistics, 146, 147, 229 Wald test, 139, 146, 148, 198, 224, 245, 357, 495, 503, 521, 549 Weak derivative, 547 Weight, 3, 4, 18–20, 29, 65, 66 Weighted square loss function, 82, 312, 313, 438 Weight matrix, 310, 449 Wild bootstrap, 108 Window size, 407–409, 412, 413, 415, 430, 431, 434 Word embedding, 269, 425, 429–439, 450 Word to vector (word2vec), 425, 430, 433, 438–440, 444, 446, 450 Word2vec algorithm, 430–436 Working residuals, 117, 186, 191, 197, 198, 244 Working weight matrix, 116, 121, 186, 191

#### **X**

XGBoost, 201

#### **Z**

Zero-inflated Poisson (ZIP), 163, 164, 166, 167, 259, 261, 263, 337 Zero-truncated claim counts, 259 Zero-truncated Poisson (ZTP), 164, 259 Z-estimator, 73, 118, 180, 457 Z-statistics, 147 ZTP log-likelihood, 262, 263 ZTP model, 155